diff mbox series

[ovs-dev,PATCHv13] netdev-afxdp: add new netdev type for AF_XDP.

Message ID 1560973919-50953-1-git-send-email-u9012063@gmail.com
State Changes Requested
Headers show
Series [ovs-dev,PATCHv13] netdev-afxdp: add new netdev type for AF_XDP. | expand

Commit Message

William Tu June 19, 2019, 7:51 p.m. UTC
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology.  It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.

Signed-off-by: William Tu <u9012063@gmail.com>
---
v1->v2:
- add a list to maintain unused umem elements
- remove copy from rx umem to ovs internal buffer
- use hugetlb to reduce misses (not much difference)
- use pmd mode netdev in OVS (huge performance improve)
- remove malloc dp_packet, instead put dp_packet in umem

v2->v3:
- rebase on the OVS master, 7ab4b0653784
  ("configure: Check for more specific function to pull in pthread library.")
- remove the dependency on libbpf and dpif-bpf.
  instead, use the built-in XDP_ATTACH feature.
- data structure optimizations for better performance, see[1]
- more test cases support
v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html

v3->v4:
- Use AF_XDP API provided by libbpf
- Remove the dependency on XDP_ATTACH kernel patch set
- Add documentation, bpf.rst

v4->v5:
- rebase to master
- remove rfc, squash all into a single patch
- add --enable-afxdp, so by default, AF_XDP is not compiled
- add options: xdpmode=drv,skb
- add multiple queue and multiple PMD support, with options: n_rxq
- improve documentation, rename bpf.rst to af_xdp.rst

v5->v6
- rebase to master, commit 0cdd5b13de91b98
- address errors from sparse and clang
- pass travis-ci test
- address feedback from Ben
- fix issues reported by 0-day robot
- improved documentation

v6-v7
- rebase to master, commit abf11558c1515bf3b1
- address feedbacks from Ilya, Ben, and Eelco, see:
  https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
- add XDP mode change, implement get/set_config, reconfigure
- Fix reconfiguration/crash issue caused by libbpf, see patch:
  [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
- perf optimization for batching umem_push/pop
- perf optimization for batching kick_tx
- test build with dpdk
- fix/refactor atomic operation
- make AF_XDP x86 specific, otherwise fail at build time
- lots of code refactoring
- add PVP setup in documentation

v7-v8:
- Address feedback from Ilya at:
  https://patchwork.ozlabs.org/patch/1095019/
- add netdev-linux-private.h
- fix afxdp reconfigure issue
- sort include headers
- remove unnecessary OVS_UNUSED
- coding style fixes
- error case handling and memory leak

v8-v9:
- rebase to master 180bbbed3a3867d52
- Address review feedback from Ben, Ilya and Eelco, at:
  https://patchwork.ozlabs.org/patch/1097740/
- == From Ilya ==
- Optimize the reconfiguration logic
- Implement .rxq_recv and .send for afxdp
- Remove system-afxdp-traffic.at, reuse existing code
- Use Ilya's rdtsc code
- remove --disable-system
- == From Eelco ==
- Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111:
  assertion !fd != !wevent failed
- Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT...
- Clear xdp program when receive signal, ctrl+c
- Add options to vswitch.xml, set xdpmode default to skb-mode
- No support for ARM and PPC, now x86_64 only
- remove redundant header includes and function/macro definitions
- remove some ifdef HAVE_AF_XDP
- == From others/both about afxdp rx and tx ==
- Several umem push/pop error handling improvement/fixes
- add lock to address concurrent_txq case
- improve error handling
- add stats
- Things that are not done yet
- MTU limitation
- n_txq_desc/n_rxq_desc option.

v9-v10
- remove x86_64 limitation, suggested by Ben and Eelco
- add xmalloc_pagealign, free_pagealign
- minor refector

v10-v11
- address feedback from Ilya at
  https://patchwork.ozlabs.org/patch/1106495/
- fix typos, and some refactoring
- refactor existing code and introduce xmalloc pagealign
- fix a couple of error handling case
- allocate per-txq lock
- dynamic allocate xsk array
- fix cycle_counter_update() for non-x86/non-linux case

v11-v12
- mainly address a couple of crashes reported by Eelco
  https://patchwork.ozlabs.org/patch/1110729/
- fix cleanup xdp program problem when ovs-vswtichd restarts
- following cases should remove xdp program
  - kill `pidof ovs-vswitchd`
  - ovs-appctl -t ovs-vswtichd exit --cleanup
  - note: ovs-ctl restart does not have "--cleanup" so still an issue
- work around issues of xsk_ring_cons__peek at libbpf, reported at
  https://marc.info/?l=xdp-newbies&m=156055471727857&w=2
- variable name refactoring
- there are some performance degradation, but let's make sure
  everything works first

v12-v13
- rebase to master
- add coverage counter afxdp_cq_emtpy, afxdp_fq_full
- minor refactoring
---
 Documentation/automake.mk             |   1 +
 Documentation/index.rst               |   1 +
 Documentation/intro/install/afxdp.rst | 425 ++++++++++++++++
 Documentation/intro/install/index.rst |   1 +
 acinclude.m4                          |  35 ++
 configure.ac                          |   1 +
 lib/automake.mk                       |  14 +
 lib/dp-packet.c                       |  28 ++
 lib/dp-packet.h                       |  18 +-
 lib/dpif-netdev-perf.h                |  26 +
 lib/netdev-afxdp.c                    | 891 ++++++++++++++++++++++++++++++++++
 lib/netdev-afxdp.h                    |  74 +++
 lib/netdev-linux-private.h            | 138 ++++++
 lib/netdev-linux.c                    | 121 ++---
 lib/netdev-provider.h                 |   3 +
 lib/netdev.c                          |  11 +
 lib/spinlock.h                        |  70 +++
 lib/util.c                            |  92 +++-
 lib/util.h                            |   5 +
 lib/xdpsock.c                         | 170 +++++++
 lib/xdpsock.h                         | 101 ++++
 tests/automake.mk                     |  16 +
 tests/system-afxdp-macros.at          |  20 +
 tests/system-afxdp-testsuite.at       |  26 +
 vswitchd/vswitch.xml                  |  30 ++
 25 files changed, 2210 insertions(+), 108 deletions(-)
 create mode 100644 Documentation/intro/install/afxdp.rst
 create mode 100644 lib/netdev-afxdp.c
 create mode 100644 lib/netdev-afxdp.h
 create mode 100644 lib/netdev-linux-private.h
 create mode 100644 lib/spinlock.h
 create mode 100644 lib/xdpsock.c
 create mode 100644 lib/xdpsock.h
 create mode 100644 tests/system-afxdp-macros.at
 create mode 100644 tests/system-afxdp-testsuite.at

Comments

Ilya Maximets June 21, 2019, 2:24 p.m. UTC | #1
On 19.06.2019 22:51, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
> 
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> 
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
> 
> v4->v5:
> - rebase to master
> - remove rfc, squash all into a single patch
> - add --enable-afxdp, so by default, AF_XDP is not compiled
> - add options: xdpmode=drv,skb
> - add multiple queue and multiple PMD support, with options: n_rxq
> - improve documentation, rename bpf.rst to af_xdp.rst
> 
> v5->v6
> - rebase to master, commit 0cdd5b13de91b98
> - address errors from sparse and clang
> - pass travis-ci test
> - address feedback from Ben
> - fix issues reported by 0-day robot
> - improved documentation
> 
> v6-v7
> - rebase to master, commit abf11558c1515bf3b1
> - address feedbacks from Ilya, Ben, and Eelco, see:
>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> - add XDP mode change, implement get/set_config, reconfigure
> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> - perf optimization for batching umem_push/pop
> - perf optimization for batching kick_tx
> - test build with dpdk
> - fix/refactor atomic operation
> - make AF_XDP x86 specific, otherwise fail at build time
> - lots of code refactoring
> - add PVP setup in documentation
> 
> v7-v8:
> - Address feedback from Ilya at:
>   https://patchwork.ozlabs.org/patch/1095019/
> - add netdev-linux-private.h
> - fix afxdp reconfigure issue
> - sort include headers
> - remove unnecessary OVS_UNUSED
> - coding style fixes
> - error case handling and memory leak
> 
> v8-v9:
> - rebase to master 180bbbed3a3867d52
> - Address review feedback from Ben, Ilya and Eelco, at:
>   https://patchwork.ozlabs.org/patch/1097740/
> - == From Ilya ==
> - Optimize the reconfiguration logic
> - Implement .rxq_recv and .send for afxdp
> - Remove system-afxdp-traffic.at, reuse existing code
> - Use Ilya's rdtsc code
> - remove --disable-system
> - == From Eelco ==
> - Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111:
>   assertion !fd != !wevent failed
> - Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT...
> - Clear xdp program when receive signal, ctrl+c
> - Add options to vswitch.xml, set xdpmode default to skb-mode
> - No support for ARM and PPC, now x86_64 only
> - remove redundant header includes and function/macro definitions
> - remove some ifdef HAVE_AF_XDP
> - == From others/both about afxdp rx and tx ==
> - Several umem push/pop error handling improvement/fixes
> - add lock to address concurrent_txq case
> - improve error handling
> - add stats
> - Things that are not done yet
> - MTU limitation
> - n_txq_desc/n_rxq_desc option.
> 
> v9-v10
> - remove x86_64 limitation, suggested by Ben and Eelco
> - add xmalloc_pagealign, free_pagealign
> - minor refector
> 
> v10-v11
> - address feedback from Ilya at
>   https://patchwork.ozlabs.org/patch/1106495/
> - fix typos, and some refactoring
> - refactor existing code and introduce xmalloc pagealign
> - fix a couple of error handling case
> - allocate per-txq lock
> - dynamic allocate xsk array
> - fix cycle_counter_update() for non-x86/non-linux case
> 
> v11-v12
> - mainly address a couple of crashes reported by Eelco
>   https://patchwork.ozlabs.org/patch/1110729/
> - fix cleanup xdp program problem when ovs-vswtichd restarts
> - following cases should remove xdp program
>   - kill `pidof ovs-vswitchd`
>   - ovs-appctl -t ovs-vswtichd exit --cleanup
>   - note: ovs-ctl restart does not have "--cleanup" so still an issue
> - work around issues of xsk_ring_cons__peek at libbpf, reported at
>   https://marc.info/?l=xdp-newbies&m=156055471727857&w=2
> - variable name refactoring
> - there are some performance degradation, but let's make sure
>   everything works first
> 
> v12-v13
> - rebase to master
> - add coverage counter afxdp_cq_emtpy, afxdp_fq_full
> - minor refactoring


Hi!
I finally managed to successfully run 'make check-afxdp' with the
following results:

  ERROR: 76 tests were run,
  6 failed unexpectedly.
  48 tests were skipped.

Failed tests are IP fragmentation expiry for conntrack (I'll send a
separate e-mail about this issue) and NSH tests which are broken for
a while now (not related to XDP).

However, here is the list of issues I faced and had to fix/workaround
to make in work:

1. Abort while trying to push any header:

  Thread 10 "pmd8" received signal SIGABRT, Aborted.
  [Switching to Thread 0x7ff3ab9d9700 (LWP 12151)]
  0x00007ff3b311793f in raise () from /lib64/libc.so.6
  (gdb) bt
  #0  0x00007ff3b311793f in raise () from /lib64/libc.so.6
  #1  0x00007ff3b3101c95 in abort () from /lib64/libc.so.6
  #2  0x0000000000683930 in dp_packet_resize__ (b=0x7ff3a9d346b0, new_headroom=64, new_tailroom=<out>) at ./lib/dp-packet.h:613
  #3  0x000000000068547c in dp_packet_prealloc_headroom (b=<out>, size=4) at lib/dp-packet.c:315
  #4  dp_packet_push_uninit (b=<out>, size=4) at lib/dp-packet.c:427
  #5  dp_packet_resize_l2_5 (b=0x7ff3a9d346b0, increment=4) at lib/dp-packet.c:493
  #6  0x000000000089a841 in push_mpls (packet=0x2, ethtype=<>, lse=1076953088) at lib/packets.c:391
  #7  0x0000000000762b5d in odp_execute_actions  at lib/odp-execute.c:875
  #8  0x00000000006a4493 in dp_netdev_execute_actions  at lib/dpif-netdev.c:7264
  #9  handle_packet_upcall  at lib/dpif-netdev.c:6545
  #10 fast_path_processing  at lib/dpif-netdev.c:6641
  #11 0x00000000006a2b6f in dp_netdev_input__  at lib/dpif-netdev.c:6729
  #12 0x000000000069f973 in dp_netdev_input  at lib/dpif-netdev.c:6767
  #13 dp_netdev_process_rxq_port  at lib/dpif-netdev.c:4277
  #14 0x000000000069ba0b in pmd_thread_main  at lib/dpif-netdev.c:5451
  #15 0x000000000085f6b0 in ovsthread_wrapper  at lib/ovs-thread.c:352
  #16 0x00007ff3b3e532de in start_thread () from /lib64/libpthread.so.0
  #17 0x00007ff3b31dca63 in clone () from /lib64/libc.so.6

Reason: Wrong headroom management. In current implementation headroom of
the dp-packet is always zero. This leads to resize attempts failures on
OVS_NOT_REACHED().

Here is the patch that could solve the issue:

diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index e6a794707..c9593515a 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -65,10 +65,11 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
  * memory starting at AF_XDP umem base.
  */
 void
-dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
+dp_packet_use_afxdp(struct dp_packet *b, void *data,
+                    size_t allocated, size_t headroom)
 {
-    dp_packet_set_base(b, base);
-    dp_packet_set_data(b, base);
+    dp_packet_set_base(b, (char *) data - headroom);
+    dp_packet_set_data(b, data);
     dp_packet_set_size(b, 0);
 
     dp_packet_set_allocated(b, allocated);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index e3438226e..47ea14b94 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -132,7 +132,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
 #if HAVE_AF_XDP
-void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
+void dp_packet_use_afxdp(struct dp_packet *, void *, size_t, size_t);
 #endif
 void dp_packet_init_dpdk(struct dp_packet *);
 
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index 33d861215..518389a58 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -591,7 +591,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
         packet = &xpacket->packet;
 
         /* Initialize the struct dp_packet */
-        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
+        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE, FRAME_HEADROOM);
         dp_packet_set_size(packet, len);
 
         /* Add packet into batch, increase batch->count */
@@ -646,7 +646,7 @@ free_afxdp_buf(struct dp_packet *p)
     if (xpacket->mpool) {
         void *base = dp_packet_base(p);
 
-        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
+        addr = ((uintptr_t)base + FRAME_HEADROOM) & (~FRAME_SHIFT_MASK);
         umem_elem_push(xpacket->mpool, (void *)addr);
     }
 }
@@ -664,7 +664,7 @@ free_afxdp_buf_batch(struct dp_packet_batch *batch)
         if (xpacket->mpool) {
             void *base = dp_packet_base(packet);
 
-            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
+            addr = ((uintptr_t)base + FRAME_HEADROOM) & (~FRAME_SHIFT_MASK);
             elems[i] = (void *)addr;
         }
     }
---

Works for me. Please, re-check. I might miss something.



2. vlan tests fails due to kernel issues. We need to disable vlan offloading
along with tx offloading, otherwise packets dissapears somewhere inside the
kernel. I didn't find the root cause. In v8 of this patch you disabled 'rxvlan'
and 'txvlan' for tests, but these bits are missing in newer versions.
Most probably we only need to disable 'txvlan'.



3. cvlan tests fails due to kernel issues. I didn't found a workaround for this.
Disabling the vlan offloading doesn't work. I had to force afxdp testsuite to
skip these tests:

diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
index 1e6f7a46b..91f2ef91c 100644
--- a/tests/system-afxdp-macros.at
+++ b/tests/system-afxdp-macros.at
@@ -18,3 +18,6 @@ m4_define([ADD_VETH],
       on_exit 'ip link del ovs-$1'
     ]
 )
+
+m4_define([OVS_CHECK_8021AD],
+    [AT_SKIP_IF([:])])
diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index d23ee897b..22f814fba 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -71,6 +71,7 @@ AT_CLEANUP
 
 AT_SETUP([datapath - ping between two ports on cvlan])
 OVS_TRAFFIC_VSWITCHD_START()
+OVS_CHECK_8021AD()
 
 AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
 
@@ -161,6 +162,7 @@ AT_CLEANUP
 
 AT_SETUP([datapath - ping6 between two ports on cvlan])
 OVS_TRAFFIC_VSWITCHD_START()
+OVS_CHECK_8021AD()
 
 AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
 
---



4. 'system-afxdp-testsuite' doesn't re-build on 'system-userspace-macros.at'
changes. Missing file dependencies in automake:

diff --git a/tests/automake.mk b/tests/automake.mk
index 131564bb0..f0449c395 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -163,6 +163,7 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
        tests/system-userspace-packet-type-aware.at
 
 SYSTEM_AFXDP_TESTSUITE_AT = \
+       tests/system-userspace-macros.at \
        tests/system-afxdp-testsuite.at \
        tests/system-afxdp-macros.at
 
---


5. 'make check-afxdp' executes 'make install' which is unwanted:

diff --git a/tests/automake.mk b/tests/automake.mk
index 131564bb0..d6ab51732 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -325,7 +326,6 @@ check-system-userspace: all
        "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
 
 check-afxdp: all
-       $(MAKE) install
        set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
        "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
 
---


6. TCP doesn't work with XDP over veth interfaces. This is a known kernel
issue that could be workarounded by the following patch to a kernel:
    https://github.com/cilium/cilium/issues/3077#issuecomment-430801467 .
Until proper solution implemented in kernel we probably should skip all the
tests that involves TCP.


7. It's a known issue that tunneling is not working right now in system-traffic
userspace tests. Could be workarounded by removing '--disable-system' from
OVS_TRAFFIC_VSWITCHD_START in tests/system-userspace-macros.at. I'm going to
prepare a patch for this issue in a near future.


8. As I said, I'll send a separate main about IP fragmented conntrack issues.


P.S. We probably should mention TCP, vlan and 8021ad issues on veth interfaces
     somewhere in docs, so users will be aware of them.

Best regards, Ilya Maximets.
Ilya Maximets June 21, 2019, 2:56 p.m. UTC | #2
On 19.06.2019 22:51, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---

Hi!
This is about "conntrack - IP fragmentation expiry" tests I mentioned in a
previous mail:
    https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/359971.html

There is a major bug related to a memory pools management. The issue is that
we *must not* free memory pool until there are packets from it in use by
any other code. For example, packets could be delayed for the future processing
like it happens in case of IP fragments re-assembly. We fixed same issue for
DPDK around a year ago. In practice, we must postpone actual freeing of
umem->buffer, umem_pool and the xpacket_pool until all packets freed i.e.
umemp->index != umemp->size.

You may use 'dpdk_mp_sweep' as a reference.

Here is a crash log I have with IPv4 fragmentation test:

# make -j8 check-afxdp TESTSUITEFLAGS='-v 52'
52. system-traffic.at:2398: testing conntrack - IPv4 fragmentation expiry

=================================================================
==17056==ERROR: AddressSanitizer: heap-use-after-free on address 0x7f55ab131f34 at pc 0x0000005a150f bp 0x7ffea9a44e80 sp 0x7ffea9a44e70
READ of size 1 at 0x7f55ab131f34 thread T0
    #0 0x5a150e in dp_packet_delete lib/dp-packet.h:191
    #1 0x5a150e in ipf_destroy lib/ipf.c:1332
    #2 0x8701bd in conntrack_destroy lib/conntrack.c:393
    #3 0x55bdb4 in dp_netdev_free lib/dpif-netdev.c:1642
    #4 0x55c7a7 in dp_netdev_unref lib/dpif-netdev.c:1678
    #5 0x55c83e in dp_netdev_unref lib/dpif-netdev.c:1673
    #6 0x55c83e in dpif_netdev_close lib/dpif-netdev.c:1689
    #7 0x576e25 in dpif_uninit lib/dpif.c:1683
    #8 0x576f62 in dpif_close lib/dpif.c:453
    #9 0x470a3e in close_dpif_backer ofproto/ofproto-dpif.c:684
    #10 0x47c7ef in destruct ofproto/ofproto-dpif.c:1658
    #11 0x45b519 in ofproto_destroy ofproto/ofproto.c:1665
    #12 0x414065 in bridge_destroy vswitchd/bridge.c:3319
    #13 0x426aa1 in bridge_exit vswitchd/bridge.c:509
    #14 0x409b97 in main vswitchd/ovs-vswitchd.c:143
    #15 0x7f55b44a7812 in __libc_start_main (/lib64/libc.so.6+0x23812)
    #16 0x40c46d in _start (/root/git/ovs/vswitchd/ovs-vswitchd+0x40c46d)

0x7f55ab131f34 is located 7988 bytes inside of 4653056-byte region [0x7f55ab130000,0x7f55ab5a0000)
freed by thread T0 here:
    #0 0x7f55b5e943a0 in free (/lib64/libasan.so.5+0xef3a0)
    #1 0x89fd3e in xpacket_pool_cleanup lib/xdpsock.c:168
    #2 0x7ea714 in xsk_destroy lib/netdev-afxdp.c:298
    #3 0x7ea714 in xsk_destroy_all lib/netdev-afxdp.c:315
    #4 0x7ee5d9 in netdev_afxdp_destruct lib/netdev-afxdp.c:836
    #5 0x5e8d80 in netdev_unref lib/netdev.c:577
    #6 0x4418ea in ofport_destroy__ ofproto/ofproto.c:2539
    #7 0x45b687 in ofproto_destroy ofproto/ofproto.c:1658
    #8 0x414065 in bridge_destroy vswitchd/bridge.c:3319
    #9 0x426aa1 in bridge_exit vswitchd/bridge.c:509
    #10 0x409b97 in main vswitchd/ovs-vswitchd.c:143
    #11 0x7f55b44a7812 in __libc_start_main (/lib64/libc.so.6+0x23812)

previously allocated by thread T0 here:
    #0 0x7f55b5e95580 in posix_memalign (/lib64/libasan.so.5+0xf0580)
    #1 0x780c82 in xmalloc_size_align lib/util.c:229
    #2 0x89fcb4 in xpacket_pool_init lib/xdpsock.c:156
    #3 0x7eb2da in xsk_configure_umem lib/netdev-afxdp.c:107
    #4 0x7eb2da in xsk_configure lib/netdev-afxdp.c:222
    #5 0x7eb2da in xsk_configure_all lib/netdev-afxdp.c:260
    #6 0x7eb2da in netdev_afxdp_reconfigure lib/netdev-afxdp.c:449
    #7 0x559835 in port_reconfigure lib/dpif-netdev.c:4330
    #8 0x559835 in reconfigure_datapath lib/dpif-netdev.c:4838
    #9 0x55b1cc in do_add_port lib/dpif-netdev.c:1842
    #10 0x55b683 in dpif_netdev_port_add lib/dpif-netdev.c:1868
    #11 0x5746c2 in dpif_port_add lib/dpif.c:577
    #12 0x4743b6 in port_add ofproto/ofproto-dpif.c:3713
    #13 0x44eac5 in ofproto_port_add ofproto/ofproto.c:2013
    #14 0x41536d in iface_do_create vswitchd/bridge.c:1811
    #15 0x41536d in iface_create vswitchd/bridge.c:1849
    #16 0x41536d in bridge_add_ports__ vswitchd/bridge.c:937
    #17 0x41c1a4 in bridge_add_ports vswitchd/bridge.c:953
    #18 0x41c1a4 in bridge_reconfigure vswitchd/bridge.c:667
    #19 0x4274ee in bridge_run vswitchd/bridge.c:3044
    #20 0x409a0c in main vswitchd/ovs-vswitchd.c:127
    #21 0x7f55b44a7812 in __libc_start_main (/lib64/libc.so.6+0x23812)

SUMMARY: AddressSanitizer: heap-use-after-free lib/dp-packet.h:191 in dp_packet_delete
Shadow bytes around the buggy address:
  0x0feb3561e390: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0feb3561e3a0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0feb3561e3b0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0feb3561e3c0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0feb3561e3d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x0feb3561e3e0: fd fd fd fd fd fd[fd]fd fd fd fd fd fd fd fd fd
  0x0feb3561e3f0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0feb3561e400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0feb3561e410: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0feb3561e420: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0feb3561e430: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==17056==ABORTING

As you can see above, 'ipf_destroy' tries to free delayed dp-packet while
xpacket_pool already destroyed.


Best regards, Ilya Maximets.
William Tu June 22, 2019, 6:18 a.m. UTC | #3
Hi Ilya,

Thanks for such a detailed review!

I wasn't thinking about making all "make check-afxdp" test cases
passed because there are some errors not related to XDP.
But since you've done lots of investigation, let's fix all and make it passed.

> Hi!
> I finally managed to successfully run 'make check-afxdp' with the
> following results:
>
>   ERROR: 76 tests were run,
>   6 failed unexpectedly.
>   48 tests were skipped.
>
> Failed tests are IP fragmentation expiry for conntrack (I'll send a
> separate e-mail about this issue) and NSH tests which are broken for
> a while now (not related to XDP).
>
> However, here is the list of issues I faced and had to fix/workaround
> to make in work:
>
> 1. Abort while trying to push any header:
>
>   Thread 10 "pmd8" received signal SIGABRT, Aborted.
>   [Switching to Thread 0x7ff3ab9d9700 (LWP 12151)]
>   0x00007ff3b311793f in raise () from /lib64/libc.so.6
>   (gdb) bt
>   #0  0x00007ff3b311793f in raise () from /lib64/libc.so.6
>   #1  0x00007ff3b3101c95 in abort () from /lib64/libc.so.6
>   #2  0x0000000000683930 in dp_packet_resize__ (b=0x7ff3a9d346b0, new_headroom=64, new_tailroom=<out>) at ./lib/dp-packet.h:613
>   #3  0x000000000068547c in dp_packet_prealloc_headroom (b=<out>, size=4) at lib/dp-packet.c:315
>   #4  dp_packet_push_uninit (b=<out>, size=4) at lib/dp-packet.c:427
>   #5  dp_packet_resize_l2_5 (b=0x7ff3a9d346b0, increment=4) at lib/dp-packet.c:493
>   #6  0x000000000089a841 in push_mpls (packet=0x2, ethtype=<>, lse=1076953088) at lib/packets.c:391
>   #7  0x0000000000762b5d in odp_execute_actions  at lib/odp-execute.c:875
>   #8  0x00000000006a4493 in dp_netdev_execute_actions  at lib/dpif-netdev.c:7264
>   #9  handle_packet_upcall  at lib/dpif-netdev.c:6545
>   #10 fast_path_processing  at lib/dpif-netdev.c:6641
>   #11 0x00000000006a2b6f in dp_netdev_input__  at lib/dpif-netdev.c:6729
>   #12 0x000000000069f973 in dp_netdev_input  at lib/dpif-netdev.c:6767
>   #13 dp_netdev_process_rxq_port  at lib/dpif-netdev.c:4277
>   #14 0x000000000069ba0b in pmd_thread_main  at lib/dpif-netdev.c:5451
>   #15 0x000000000085f6b0 in ovsthread_wrapper  at lib/ovs-thread.c:352
>   #16 0x00007ff3b3e532de in start_thread () from /lib64/libpthread.so.0
>   #17 0x00007ff3b31dca63 in clone () from /lib64/libc.so.6
>
> Reason: Wrong headroom management. In current implementation headroom of
> the dp-packet is always zero. This leads to resize attempts failures on
> OVS_NOT_REACHED().
>
> Here is the patch that could solve the issue:
>
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index e6a794707..c9593515a 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -65,10 +65,11 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
>   * memory starting at AF_XDP umem base.
>   */
>  void
> -dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
> +dp_packet_use_afxdp(struct dp_packet *b, void *data,
> +                    size_t allocated, size_t headroom)
>  {
> -    dp_packet_set_base(b, base);
> -    dp_packet_set_data(b, base);
> +    dp_packet_set_base(b, (char *) data - headroom);
> +    dp_packet_set_data(b, data);
>      dp_packet_set_size(b, 0);
>
>      dp_packet_set_allocated(b, allocated);
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index e3438226e..47ea14b94 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -132,7 +132,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
>  #if HAVE_AF_XDP
> -void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t, size_t);
>  #endif
>  void dp_packet_init_dpdk(struct dp_packet *);
>
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index 33d861215..518389a58 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -591,7 +591,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>          packet = &xpacket->packet;
>
>          /* Initialize the struct dp_packet */
> -        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE, FRAME_HEADROOM);

I have to double check.
The FRAME_HEADROOM is actually the XDP_PACKET_HEADROOM, which is
reserved for driver to put metadata. I'm afraid we will over-write
something above.
Maybe we should reserve our own headroom by calling umem api, setting the
frame_headroom below in xsk.h
struct xsk_umem_config {
    __u32 fill_size;
    __u32 comp_size;
    __u32 frame_size;
    __u32 frame_headroom;
};

>          dp_packet_set_size(packet, len);
>
>          /* Add packet into batch, increase batch->count */
> @@ -646,7 +646,7 @@ free_afxdp_buf(struct dp_packet *p)
>      if (xpacket->mpool) {
>          void *base = dp_packet_base(p);
>
> -        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> +        addr = ((uintptr_t)base + FRAME_HEADROOM) & (~FRAME_SHIFT_MASK);
>          umem_elem_push(xpacket->mpool, (void *)addr);
>      }
>  }
> @@ -664,7 +664,7 @@ free_afxdp_buf_batch(struct dp_packet_batch *batch)
>          if (xpacket->mpool) {
>              void *base = dp_packet_base(packet);
>
> -            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> +            addr = ((uintptr_t)base + FRAME_HEADROOM) & (~FRAME_SHIFT_MASK);
>              elems[i] = (void *)addr;
>          }
>      }
> ---
>
> Works for me. Please, re-check. I might miss something.

Thanks, I will do it.

>
>
>
> 2. vlan tests fails due to kernel issues. We need to disable vlan offloading
> along with tx offloading, otherwise packets dissapears somewhere inside the
> kernel. I didn't find the root cause. In v8 of this patch you disabled 'rxvlan'
> and 'txvlan' for tests, but these bits are missing in newer versions.
> Most probably we only need to disable 'txvlan'.
>
I will test it, thanks.
>
>
> 3. cvlan tests fails due to kernel issues. I didn't found a workaround for this.
> Disabling the vlan offloading doesn't work. I had to force afxdp testsuite to
> skip these tests:
>
> diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
> index 1e6f7a46b..91f2ef91c 100644
> --- a/tests/system-afxdp-macros.at
> +++ b/tests/system-afxdp-macros.at
> @@ -18,3 +18,6 @@ m4_define([ADD_VETH],
>        on_exit 'ip link del ovs-$1'
>      ]
>  )
> +
> +m4_define([OVS_CHECK_8021AD],
> +    [AT_SKIP_IF([:])])
> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> index d23ee897b..22f814fba 100644
> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -71,6 +71,7 @@ AT_CLEANUP
>
>  AT_SETUP([datapath - ping between two ports on cvlan])
>  OVS_TRAFFIC_VSWITCHD_START()
> +OVS_CHECK_8021AD()
>
>  AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
>
> @@ -161,6 +162,7 @@ AT_CLEANUP
>
>  AT_SETUP([datapath - ping6 between two ports on cvlan])
>  OVS_TRAFFIC_VSWITCHD_START()
> +OVS_CHECK_8021AD()
>
>  AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
>
> ---
>
>
>
> 4. 'system-afxdp-testsuite' doesn't re-build on 'system-userspace-macros.at'
> changes. Missing file dependencies in automake:
>
> diff --git a/tests/automake.mk b/tests/automake.mk
> index 131564bb0..f0449c395 100644
> --- a/tests/automake.mk
> +++ b/tests/automake.mk
> @@ -163,6 +163,7 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>         tests/system-userspace-packet-type-aware.at
>
>  SYSTEM_AFXDP_TESTSUITE_AT = \
> +       tests/system-userspace-macros.at \
>         tests/system-afxdp-testsuite.at \
>         tests/system-afxdp-macros.at
>
Thanks!

> ---
>
>
> 5. 'make check-afxdp' executes 'make install' which is unwanted:
>
> diff --git a/tests/automake.mk b/tests/automake.mk
> index 131564bb0..d6ab51732 100644
> --- a/tests/automake.mk
> +++ b/tests/automake.mk
> @@ -325,7 +326,6 @@ check-system-userspace: all
>         "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>
>  check-afxdp: all
> -       $(MAKE) install
>         set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
>         "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>
Thanks!

> ---
>
>
> 6. TCP doesn't work with XDP over veth interfaces. This is a known kernel
> issue that could be workarounded by the following patch to a kernel:
>     https://github.com/cilium/cilium/issues/3077#issuecomment-430801467 .
> Until proper solution implemented in kernel we probably should skip all the
> tests that involves TCP.

OK, I will do it in next version.

>
>
> 7. It's a known issue that tunneling is not working right now in system-traffic
> userspace tests. Could be workarounded by removing '--disable-system' from
> OVS_TRAFFIC_VSWITCHD_START in tests/system-userspace-macros.at. I'm going to
> prepare a patch for this issue in a near future.
>

Right, we also try to fix this issue before. But after removing
'--disable-system',
we hit another issue related to revalidator. I will provide more
details next week.

>
> 8. As I said, I'll send a separate main about IP fragmented conntrack issues.
>
>
> P.S. We probably should mention TCP, vlan and 8021ad issues on veth interfaces
>      somewhere in docs, so users will be aware of them.
>
> Best regards, Ilya Maximets.

btw, do you know any CI system, such as travis, so we can run
make check-system-userspace , or make check-afxdp?

Regards,
William
Ilya Maximets June 24, 2019, 5:23 p.m. UTC | #4
On 22.06.2019 9:18, William Tu wrote:
> Hi Ilya,
> 
> Thanks for such a detailed review!
> 
> I wasn't thinking about making all "make check-afxdp" test cases
> passed because there are some errors not related to XDP.
> But since you've done lots of investigation, let's fix all and make it passed.
> 
>> Hi!
>> I finally managed to successfully run 'make check-afxdp' with the
>> following results:
>>
>>   ERROR: 76 tests were run,
>>   6 failed unexpectedly.
>>   48 tests were skipped.
>>
>> Failed tests are IP fragmentation expiry for conntrack (I'll send a
>> separate e-mail about this issue) and NSH tests which are broken for
>> a while now (not related to XDP).
>>
>> However, here is the list of issues I faced and had to fix/workaround
>> to make in work:
>>
>> 1. Abort while trying to push any header:
>>
>>   Thread 10 "pmd8" received signal SIGABRT, Aborted.
>>   [Switching to Thread 0x7ff3ab9d9700 (LWP 12151)]
>>   0x00007ff3b311793f in raise () from /lib64/libc.so.6
>>   (gdb) bt
>>   #0  0x00007ff3b311793f in raise () from /lib64/libc.so.6
>>   #1  0x00007ff3b3101c95 in abort () from /lib64/libc.so.6
>>   #2  0x0000000000683930 in dp_packet_resize__ (b=0x7ff3a9d346b0, new_headroom=64, new_tailroom=<out>) at ./lib/dp-packet.h:613
>>   #3  0x000000000068547c in dp_packet_prealloc_headroom (b=<out>, size=4) at lib/dp-packet.c:315
>>   #4  dp_packet_push_uninit (b=<out>, size=4) at lib/dp-packet.c:427
>>   #5  dp_packet_resize_l2_5 (b=0x7ff3a9d346b0, increment=4) at lib/dp-packet.c:493
>>   #6  0x000000000089a841 in push_mpls (packet=0x2, ethtype=<>, lse=1076953088) at lib/packets.c:391
>>   #7  0x0000000000762b5d in odp_execute_actions  at lib/odp-execute.c:875
>>   #8  0x00000000006a4493 in dp_netdev_execute_actions  at lib/dpif-netdev.c:7264
>>   #9  handle_packet_upcall  at lib/dpif-netdev.c:6545
>>   #10 fast_path_processing  at lib/dpif-netdev.c:6641
>>   #11 0x00000000006a2b6f in dp_netdev_input__  at lib/dpif-netdev.c:6729
>>   #12 0x000000000069f973 in dp_netdev_input  at lib/dpif-netdev.c:6767
>>   #13 dp_netdev_process_rxq_port  at lib/dpif-netdev.c:4277
>>   #14 0x000000000069ba0b in pmd_thread_main  at lib/dpif-netdev.c:5451
>>   #15 0x000000000085f6b0 in ovsthread_wrapper  at lib/ovs-thread.c:352
>>   #16 0x00007ff3b3e532de in start_thread () from /lib64/libpthread.so.0
>>   #17 0x00007ff3b31dca63 in clone () from /lib64/libc.so.6
>>
>> Reason: Wrong headroom management. In current implementation headroom of
>> the dp-packet is always zero. This leads to resize attempts failures on
>> OVS_NOT_REACHED().
>>
>> Here is the patch that could solve the issue:
>>
>> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
>> index e6a794707..c9593515a 100644
>> --- a/lib/dp-packet.c
>> +++ b/lib/dp-packet.c
>> @@ -65,10 +65,11 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
>>   * memory starting at AF_XDP umem base.
>>   */
>>  void
>> -dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
>> +dp_packet_use_afxdp(struct dp_packet *b, void *data,
>> +                    size_t allocated, size_t headroom)
>>  {
>> -    dp_packet_set_base(b, base);
>> -    dp_packet_set_data(b, base);
>> +    dp_packet_set_base(b, (char *) data - headroom);
>> +    dp_packet_set_data(b, data);
>>      dp_packet_set_size(b, 0);
>>
>>      dp_packet_set_allocated(b, allocated);
>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>> index e3438226e..47ea14b94 100644
>> --- a/lib/dp-packet.h
>> +++ b/lib/dp-packet.h
>> @@ -132,7 +132,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
>>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
>>  #if HAVE_AF_XDP
>> -void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
>> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t, size_t);
>>  #endif
>>  void dp_packet_init_dpdk(struct dp_packet *);
>>
>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>> index 33d861215..518389a58 100644
>> --- a/lib/netdev-afxdp.c
>> +++ b/lib/netdev-afxdp.c
>> @@ -591,7 +591,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>          packet = &xpacket->packet;
>>
>>          /* Initialize the struct dp_packet */
>> -        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
>> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE, FRAME_HEADROOM);
> 
> I have to double check.
> The FRAME_HEADROOM is actually the XDP_PACKET_HEADROOM, which is
> reserved for driver to put metadata. I'm afraid we will over-write
> something above.
> Maybe we should reserve our own headroom by calling umem api, setting the
> frame_headroom below in xsk.h
> struct xsk_umem_config {
>     __u32 fill_size;
>     __u32 comp_size;
>     __u32 frame_size;
>     __u32 frame_headroom;
> };

Yes. You're right, we only guaranteed to have 'umem->headroom' bytes before the
'addr' read from the rx ring. So, we should set 'frame_headroom' to some value
(128 at least) and perform same calculations as I made, but with 'umem->headroom'
instead of FRAME_HEADROOM. Like:
    dp_packet_use_afxdp(packet, pkt, FRAME_SIZE, umem->headroom);

Looks like 2 chunks below are not needed in this case.

> 
>>          dp_packet_set_size(packet, len);
>>
>>          /* Add packet into batch, increase batch->count */
>> @@ -646,7 +646,7 @@ free_afxdp_buf(struct dp_packet *p)
>>      if (xpacket->mpool) {
>>          void *base = dp_packet_base(p);
>>
>> -        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
>> +        addr = ((uintptr_t)base + FRAME_HEADROOM) & (~FRAME_SHIFT_MASK);
>>          umem_elem_push(xpacket->mpool, (void *)addr);
>>      }
>>  }
>> @@ -664,7 +664,7 @@ free_afxdp_buf_batch(struct dp_packet_batch *batch)
>>          if (xpacket->mpool) {
>>              void *base = dp_packet_base(packet);
>>
>> -            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
>> +            addr = ((uintptr_t)base + FRAME_HEADROOM) & (~FRAME_SHIFT_MASK);
>>              elems[i] = (void *)addr;
>>          }
>>      }
>> ---
>>
>> Works for me. Please, re-check. I might miss something.
> 
> Thanks, I will do it.
> 
>>
>>
>>
>> 2. vlan tests fails due to kernel issues. We need to disable vlan offloading
>> along with tx offloading, otherwise packets dissapears somewhere inside the
>> kernel. I didn't find the root cause. In v8 of this patch you disabled 'rxvlan'
>> and 'txvlan' for tests, but these bits are missing in newer versions.
>> Most probably we only need to disable 'txvlan'.
>>
> I will test it, thanks.
>>
>>
>> 3. cvlan tests fails due to kernel issues. I didn't found a workaround for this.
>> Disabling the vlan offloading doesn't work. I had to force afxdp testsuite to
>> skip these tests:
>>
>> diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
>> index 1e6f7a46b..91f2ef91c 100644
>> --- a/tests/system-afxdp-macros.at
>> +++ b/tests/system-afxdp-macros.at
>> @@ -18,3 +18,6 @@ m4_define([ADD_VETH],
>>        on_exit 'ip link del ovs-$1'
>>      ]
>>  )
>> +
>> +m4_define([OVS_CHECK_8021AD],
>> +    [AT_SKIP_IF([:])])
>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>> index d23ee897b..22f814fba 100644
>> --- a/tests/system-traffic.at
>> +++ b/tests/system-traffic.at
>> @@ -71,6 +71,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([datapath - ping between two ports on cvlan])
>>  OVS_TRAFFIC_VSWITCHD_START()
>> +OVS_CHECK_8021AD()
>>
>>  AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
>>
>> @@ -161,6 +162,7 @@ AT_CLEANUP
>>
>>  AT_SETUP([datapath - ping6 between two ports on cvlan])
>>  OVS_TRAFFIC_VSWITCHD_START()
>> +OVS_CHECK_8021AD()
>>
>>  AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
>>
>> ---
>>
>>
>>
>> 4. 'system-afxdp-testsuite' doesn't re-build on 'system-userspace-macros.at'
>> changes. Missing file dependencies in automake:
>>
>> diff --git a/tests/automake.mk b/tests/automake.mk
>> index 131564bb0..f0449c395 100644
>> --- a/tests/automake.mk
>> +++ b/tests/automake.mk
>> @@ -163,6 +163,7 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>>         tests/system-userspace-packet-type-aware.at
>>
>>  SYSTEM_AFXDP_TESTSUITE_AT = \
>> +       tests/system-userspace-macros.at \
>>         tests/system-afxdp-testsuite.at \
>>         tests/system-afxdp-macros.at
>>
> Thanks!
> 
>> ---
>>
>>
>> 5. 'make check-afxdp' executes 'make install' which is unwanted:
>>
>> diff --git a/tests/automake.mk b/tests/automake.mk
>> index 131564bb0..d6ab51732 100644
>> --- a/tests/automake.mk
>> +++ b/tests/automake.mk
>> @@ -325,7 +326,6 @@ check-system-userspace: all
>>         "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>>
>>  check-afxdp: all
>> -       $(MAKE) install
>>         set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
>>         "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>>
> Thanks!
> 
>> ---
>>
>>
>> 6. TCP doesn't work with XDP over veth interfaces. This is a known kernel
>> issue that could be workarounded by the following patch to a kernel:
>>     https://protect2.fireeye.com/url?k=9dce561be343bd6d.9dcfdd54-4c63fa8a4d3dbe51&u=https://github.com/cilium/cilium/issues/3077#issuecomment-430801467 .
>> Until proper solution implemented in kernel we probably should skip all the
>> tests that involves TCP.
> 
> OK, I will do it in next version.
> 
>>
>>
>> 7. It's a known issue that tunneling is not working right now in system-traffic
>> userspace tests. Could be workarounded by removing '--disable-system' from
>> OVS_TRAFFIC_VSWITCHD_START in tests/system-userspace-macros.at. I'm going to
>> prepare a patch for this issue in a near future.
>>
> 
> Right, we also try to fix this issue before. But after removing
> '--disable-system',
> we hit another issue related to revalidator. I will provide more
> details next week.
> 
>>
>> 8. As I said, I'll send a separate main about IP fragmented conntrack issues.
>>
>>
>> P.S. We probably should mention TCP, vlan and 8021ad issues on veth interfaces
>>      somewhere in docs, so users will be aware of them.
>>
>> Best regards, Ilya Maximets.
> 
> btw, do you know any CI system, such as travis, so we can run
> make check-system-userspace , or make check-afxdp?

Since travis migrated to VM based workloads we could try to run 'make check-system-userspace'
there. Regarding 'make check-afxdp', I don't know the public CI with recent enough
kernel or where we could use our custom kernel.

> 
> Regards,
> William
> 
>
Ilya Maximets June 24, 2019, 5:26 p.m. UTC | #5
On 24.06.2019 20:23, Ilya Maximets wrote:
> On 22.06.2019 9:18, William Tu wrote:
>> Hi Ilya,
>>
>> Thanks for such a detailed review!
>>
>> I wasn't thinking about making all "make check-afxdp" test cases
>> passed because there are some errors not related to XDP.
>> But since you've done lots of investigation, let's fix all and make it passed.
>>
>>> Hi!
>>> I finally managed to successfully run 'make check-afxdp' with the
>>> following results:
>>>
>>>   ERROR: 76 tests were run,
>>>   6 failed unexpectedly.
>>>   48 tests were skipped.
>>>
>>> Failed tests are IP fragmentation expiry for conntrack (I'll send a
>>> separate e-mail about this issue) and NSH tests which are broken for
>>> a while now (not related to XDP).
>>>
>>> However, here is the list of issues I faced and had to fix/workaround
>>> to make in work:
>>>
>>> 1. Abort while trying to push any header:
>>>
>>>   Thread 10 "pmd8" received signal SIGABRT, Aborted.
>>>   [Switching to Thread 0x7ff3ab9d9700 (LWP 12151)]
>>>   0x00007ff3b311793f in raise () from /lib64/libc.so.6
>>>   (gdb) bt
>>>   #0  0x00007ff3b311793f in raise () from /lib64/libc.so.6
>>>   #1  0x00007ff3b3101c95 in abort () from /lib64/libc.so.6
>>>   #2  0x0000000000683930 in dp_packet_resize__ (b=0x7ff3a9d346b0, new_headroom=64, new_tailroom=<out>) at ./lib/dp-packet.h:613
>>>   #3  0x000000000068547c in dp_packet_prealloc_headroom (b=<out>, size=4) at lib/dp-packet.c:315
>>>   #4  dp_packet_push_uninit (b=<out>, size=4) at lib/dp-packet.c:427
>>>   #5  dp_packet_resize_l2_5 (b=0x7ff3a9d346b0, increment=4) at lib/dp-packet.c:493
>>>   #6  0x000000000089a841 in push_mpls (packet=0x2, ethtype=<>, lse=1076953088) at lib/packets.c:391
>>>   #7  0x0000000000762b5d in odp_execute_actions  at lib/odp-execute.c:875
>>>   #8  0x00000000006a4493 in dp_netdev_execute_actions  at lib/dpif-netdev.c:7264
>>>   #9  handle_packet_upcall  at lib/dpif-netdev.c:6545
>>>   #10 fast_path_processing  at lib/dpif-netdev.c:6641
>>>   #11 0x00000000006a2b6f in dp_netdev_input__  at lib/dpif-netdev.c:6729
>>>   #12 0x000000000069f973 in dp_netdev_input  at lib/dpif-netdev.c:6767
>>>   #13 dp_netdev_process_rxq_port  at lib/dpif-netdev.c:4277
>>>   #14 0x000000000069ba0b in pmd_thread_main  at lib/dpif-netdev.c:5451
>>>   #15 0x000000000085f6b0 in ovsthread_wrapper  at lib/ovs-thread.c:352
>>>   #16 0x00007ff3b3e532de in start_thread () from /lib64/libpthread.so.0
>>>   #17 0x00007ff3b31dca63 in clone () from /lib64/libc.so.6
>>>
>>> Reason: Wrong headroom management. In current implementation headroom of
>>> the dp-packet is always zero. This leads to resize attempts failures on
>>> OVS_NOT_REACHED().
>>>
>>> Here is the patch that could solve the issue:
>>>
>>> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
>>> index e6a794707..c9593515a 100644
>>> --- a/lib/dp-packet.c
>>> +++ b/lib/dp-packet.c
>>> @@ -65,10 +65,11 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
>>>   * memory starting at AF_XDP umem base.
>>>   */
>>>  void
>>> -dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
>>> +dp_packet_use_afxdp(struct dp_packet *b, void *data,
>>> +                    size_t allocated, size_t headroom)
>>>  {
>>> -    dp_packet_set_base(b, base);
>>> -    dp_packet_set_data(b, base);
>>> +    dp_packet_set_base(b, (char *) data - headroom);
>>> +    dp_packet_set_data(b, data);
>>>      dp_packet_set_size(b, 0);
>>>
>>>      dp_packet_set_allocated(b, allocated);
>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>>> index e3438226e..47ea14b94 100644
>>> --- a/lib/dp-packet.h
>>> +++ b/lib/dp-packet.h
>>> @@ -132,7 +132,7 @@ void dp_packet_use(struct dp_packet *, void *, size_t);
>>>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>>>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
>>>  #if HAVE_AF_XDP
>>> -void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
>>> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t, size_t);
>>>  #endif
>>>  void dp_packet_init_dpdk(struct dp_packet *);
>>>
>>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>>> index 33d861215..518389a58 100644
>>> --- a/lib/netdev-afxdp.c
>>> +++ b/lib/netdev-afxdp.c
>>> @@ -591,7 +591,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>>          packet = &xpacket->packet;
>>>
>>>          /* Initialize the struct dp_packet */
>>> -        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
>>> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE, FRAME_HEADROOM);
>>
>> I have to double check.
>> The FRAME_HEADROOM is actually the XDP_PACKET_HEADROOM, which is
>> reserved for driver to put metadata. I'm afraid we will over-write
>> something above.
>> Maybe we should reserve our own headroom by calling umem api, setting the
>> frame_headroom below in xsk.h
>> struct xsk_umem_config {
>>     __u32 fill_size;
>>     __u32 comp_size;
>>     __u32 frame_size;
>>     __u32 frame_headroom;
>> };
> 
> Yes. You're right, we only guaranteed to have 'umem->headroom' bytes before the
> 'addr' read from the rx ring. So, we should set 'frame_headroom' to some value
> (128 at least) and perform same calculations as I made, but with 'umem->headroom'
> instead of FRAME_HEADROOM. Like:
>     dp_packet_use_afxdp(packet, pkt, FRAME_SIZE, umem->headroom);

      dp_packet_use_afxdp(packet, pkt,
                          FRAME_SIZE - umem->headroom - FRAME_HEADROOM,
                          umem->headroom);

> 
> Looks like 2 chunks below are not needed in this case.
> 
>>
>>>          dp_packet_set_size(packet, len);
>>>
>>>          /* Add packet into batch, increase batch->count */
>>> @@ -646,7 +646,7 @@ free_afxdp_buf(struct dp_packet *p)
>>>      if (xpacket->mpool) {
>>>          void *base = dp_packet_base(p);
>>>
>>> -        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
>>> +        addr = ((uintptr_t)base + FRAME_HEADROOM) & (~FRAME_SHIFT_MASK);
>>>          umem_elem_push(xpacket->mpool, (void *)addr);
>>>      }
>>>  }
>>> @@ -664,7 +664,7 @@ free_afxdp_buf_batch(struct dp_packet_batch *batch)
>>>          if (xpacket->mpool) {
>>>              void *base = dp_packet_base(packet);
>>>
>>> -            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
>>> +            addr = ((uintptr_t)base + FRAME_HEADROOM) & (~FRAME_SHIFT_MASK);
>>>              elems[i] = (void *)addr;
>>>          }
>>>      }
>>> ---
>>>
>>> Works for me. Please, re-check. I might miss something.
>>
>> Thanks, I will do it.
>>
>>>
>>>
>>>
>>> 2. vlan tests fails due to kernel issues. We need to disable vlan offloading
>>> along with tx offloading, otherwise packets dissapears somewhere inside the
>>> kernel. I didn't find the root cause. In v8 of this patch you disabled 'rxvlan'
>>> and 'txvlan' for tests, but these bits are missing in newer versions.
>>> Most probably we only need to disable 'txvlan'.
>>>
>> I will test it, thanks.
>>>
>>>
>>> 3. cvlan tests fails due to kernel issues. I didn't found a workaround for this.
>>> Disabling the vlan offloading doesn't work. I had to force afxdp testsuite to
>>> skip these tests:
>>>
>>> diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
>>> index 1e6f7a46b..91f2ef91c 100644
>>> --- a/tests/system-afxdp-macros.at
>>> +++ b/tests/system-afxdp-macros.at
>>> @@ -18,3 +18,6 @@ m4_define([ADD_VETH],
>>>        on_exit 'ip link del ovs-$1'
>>>      ]
>>>  )
>>> +
>>> +m4_define([OVS_CHECK_8021AD],
>>> +    [AT_SKIP_IF([:])])
>>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>>> index d23ee897b..22f814fba 100644
>>> --- a/tests/system-traffic.at
>>> +++ b/tests/system-traffic.at
>>> @@ -71,6 +71,7 @@ AT_CLEANUP
>>>
>>>  AT_SETUP([datapath - ping between two ports on cvlan])
>>>  OVS_TRAFFIC_VSWITCHD_START()
>>> +OVS_CHECK_8021AD()
>>>
>>>  AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
>>>
>>> @@ -161,6 +162,7 @@ AT_CLEANUP
>>>
>>>  AT_SETUP([datapath - ping6 between two ports on cvlan])
>>>  OVS_TRAFFIC_VSWITCHD_START()
>>> +OVS_CHECK_8021AD()
>>>
>>>  AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
>>>
>>> ---
>>>
>>>
>>>
>>> 4. 'system-afxdp-testsuite' doesn't re-build on 'system-userspace-macros.at'
>>> changes. Missing file dependencies in automake:
>>>
>>> diff --git a/tests/automake.mk b/tests/automake.mk
>>> index 131564bb0..f0449c395 100644
>>> --- a/tests/automake.mk
>>> +++ b/tests/automake.mk
>>> @@ -163,6 +163,7 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \
>>>         tests/system-userspace-packet-type-aware.at
>>>
>>>  SYSTEM_AFXDP_TESTSUITE_AT = \
>>> +       tests/system-userspace-macros.at \
>>>         tests/system-afxdp-testsuite.at \
>>>         tests/system-afxdp-macros.at
>>>
>> Thanks!
>>
>>> ---
>>>
>>>
>>> 5. 'make check-afxdp' executes 'make install' which is unwanted:
>>>
>>> diff --git a/tests/automake.mk b/tests/automake.mk
>>> index 131564bb0..d6ab51732 100644
>>> --- a/tests/automake.mk
>>> +++ b/tests/automake.mk
>>> @@ -325,7 +326,6 @@ check-system-userspace: all
>>>         "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>>>
>>>  check-afxdp: all
>>> -       $(MAKE) install
>>>         set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
>>>         "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
>>>
>> Thanks!
>>
>>> ---
>>>
>>>
>>> 6. TCP doesn't work with XDP over veth interfaces. This is a known kernel
>>> issue that could be workarounded by the following patch to a kernel:
>>>     https://protect2.fireeye.com/url?k=9dce561be343bd6d.9dcfdd54-4c63fa8a4d3dbe51&u=https://github.com/cilium/cilium/issues/3077#issuecomment-430801467 .
>>> Until proper solution implemented in kernel we probably should skip all the
>>> tests that involves TCP.
>>
>> OK, I will do it in next version.
>>
>>>
>>>
>>> 7. It's a known issue that tunneling is not working right now in system-traffic
>>> userspace tests. Could be workarounded by removing '--disable-system' from
>>> OVS_TRAFFIC_VSWITCHD_START in tests/system-userspace-macros.at. I'm going to
>>> prepare a patch for this issue in a near future.
>>>
>>
>> Right, we also try to fix this issue before. But after removing
>> '--disable-system',
>> we hit another issue related to revalidator. I will provide more
>> details next week.
>>
>>>
>>> 8. As I said, I'll send a separate main about IP fragmented conntrack issues.
>>>
>>>
>>> P.S. We probably should mention TCP, vlan and 8021ad issues on veth interfaces
>>>      somewhere in docs, so users will be aware of them.
>>>
>>> Best regards, Ilya Maximets.
>>
>> btw, do you know any CI system, such as travis, so we can run
>> make check-system-userspace , or make check-afxdp?
> 
> Since travis migrated to VM based workloads we could try to run 'make check-system-userspace'
> there. Regarding 'make check-afxdp', I don't know the public CI with recent enough
> kernel or where we could use our custom kernel.
> 
>>
>> Regards,
>> William
>>
>>
> 
>
William Tu June 25, 2019, 9:58 p.m. UTC | #6
> 7. It's a known issue that tunneling is not working right now in system-traffic
> userspace tests. Could be workarounded by removing '--disable-system' from
> OVS_TRAFFIC_VSWITCHD_START in tests/system-userspace-macros.at. I'm going to
> prepare a patch for this issue in a near future.
>
I sent out a patch for the above issue at
https://patchwork.ozlabs.org/patch/1122321/

Thanks
William
William Tu June 26, 2019, 10:24 p.m. UTC | #7
On Fri, Jun 21, 2019 at 7:56 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> On 19.06.2019 22:51, William Tu wrote:
> > The patch introduces experimental AF_XDP support for OVS netdev.
> > AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> > type built upon the eBPF and XDP technology.  It is aims to have comparable
> > performance to DPDK but cooperate better with existing kernel's networking
> > stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> > attached to the netdev, by-passing a couple of Linux kernel's subsystems
> > As a result, AF_XDP socket shows much better performance than AF_PACKET
> > For more details about AF_XDP, please see linux kernel's
> > Documentation/networking/af_xdp.rst. Note that by default, this feature is
> > not compiled in.
> >
> > Signed-off-by: William Tu <u9012063@gmail.com>
> > ---
>
> Hi!
> This is about "conntrack - IP fragmentation expiry" tests I mentioned in a
> previous mail:
>     https://mail.openvswitch.org/pipermail/ovs-dev/2019-June/359971.html
>
> There is a major bug related to a memory pools management. The issue is that
> we *must not* free memory pool until there are packets from it in use by
> any other code. For example, packets could be delayed for the future processing
> like it happens in case of IP fragments re-assembly. We fixed same issue for
> DPDK around a year ago. In practice, we must postpone actual freeing of
> umem->buffer, umem_pool and the xpacket_pool until all packets freed i.e.
> umemp->index != umemp->size.
>
> You may use 'dpdk_mp_sweep' as a reference.
>
Hi Ilya,

Thanks, I can reproduce the issue.

So we can only free the umem_pool, umem->buffer and xpacket_pool
when umemp->index == umemp-size, (meaning all elems we pop have
been pushed back to the umem pool).

One extra thing to work on is to reclaim umem memory on the queues when
destroy xsk. Ex: reclaim umem elems on fill queues and make sure elems in
rx queue are all processed. And reclaim umem elems on tx queues, and make
sure elems in completion queues are all processed.

I'm working on the idea similar to the dpdk_mp_sweep.

Regards,
William

<snip>
Ilya Maximets June 27, 2019, 5:07 p.m. UTC | #8
Just a few comments inline.

Best regards, Ilya Maximets.

On 19.06.2019 22:51, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
> 
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> 
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
> 
> v4->v5:
> - rebase to master
> - remove rfc, squash all into a single patch
> - add --enable-afxdp, so by default, AF_XDP is not compiled
> - add options: xdpmode=drv,skb
> - add multiple queue and multiple PMD support, with options: n_rxq
> - improve documentation, rename bpf.rst to af_xdp.rst
> 
> v5->v6
> - rebase to master, commit 0cdd5b13de91b98
> - address errors from sparse and clang
> - pass travis-ci test
> - address feedback from Ben
> - fix issues reported by 0-day robot
> - improved documentation
> 
> v6-v7
> - rebase to master, commit abf11558c1515bf3b1
> - address feedbacks from Ilya, Ben, and Eelco, see:
>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> - add XDP mode change, implement get/set_config, reconfigure
> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> - perf optimization for batching umem_push/pop
> - perf optimization for batching kick_tx
> - test build with dpdk
> - fix/refactor atomic operation
> - make AF_XDP x86 specific, otherwise fail at build time
> - lots of code refactoring
> - add PVP setup in documentation
> 
> v7-v8:
> - Address feedback from Ilya at:
>   https://patchwork.ozlabs.org/patch/1095019/
> - add netdev-linux-private.h
> - fix afxdp reconfigure issue
> - sort include headers
> - remove unnecessary OVS_UNUSED
> - coding style fixes
> - error case handling and memory leak
> 
> v8-v9:
> - rebase to master 180bbbed3a3867d52
> - Address review feedback from Ben, Ilya and Eelco, at:
>   https://patchwork.ozlabs.org/patch/1097740/
> - == From Ilya ==
> - Optimize the reconfiguration logic
> - Implement .rxq_recv and .send for afxdp
> - Remove system-afxdp-traffic.at, reuse existing code
> - Use Ilya's rdtsc code
> - remove --disable-system
> - == From Eelco ==
> - Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111:
>   assertion !fd != !wevent failed
> - Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT...
> - Clear xdp program when receive signal, ctrl+c
> - Add options to vswitch.xml, set xdpmode default to skb-mode
> - No support for ARM and PPC, now x86_64 only
> - remove redundant header includes and function/macro definitions
> - remove some ifdef HAVE_AF_XDP
> - == From others/both about afxdp rx and tx ==
> - Several umem push/pop error handling improvement/fixes
> - add lock to address concurrent_txq case
> - improve error handling
> - add stats
> - Things that are not done yet
> - MTU limitation
> - n_txq_desc/n_rxq_desc option.
> 
> v9-v10
> - remove x86_64 limitation, suggested by Ben and Eelco
> - add xmalloc_pagealign, free_pagealign
> - minor refector
> 
> v10-v11
> - address feedback from Ilya at
>   https://patchwork.ozlabs.org/patch/1106495/
> - fix typos, and some refactoring
> - refactor existing code and introduce xmalloc pagealign
> - fix a couple of error handling case
> - allocate per-txq lock
> - dynamic allocate xsk array
> - fix cycle_counter_update() for non-x86/non-linux case
> 
> v11-v12
> - mainly address a couple of crashes reported by Eelco
>   https://patchwork.ozlabs.org/patch/1110729/
> - fix cleanup xdp program problem when ovs-vswtichd restarts
> - following cases should remove xdp program
>   - kill `pidof ovs-vswitchd`
>   - ovs-appctl -t ovs-vswtichd exit --cleanup
>   - note: ovs-ctl restart does not have "--cleanup" so still an issue
> - work around issues of xsk_ring_cons__peek at libbpf, reported at
>   https://marc.info/?l=xdp-newbies&m=156055471727857&w=2
> - variable name refactoring
> - there are some performance degradation, but let's make sure
>   everything works first
> 
> v12-v13
> - rebase to master
> - add coverage counter afxdp_cq_emtpy, afxdp_fq_full
> - minor refactoring
> ---
>  Documentation/automake.mk             |   1 +
>  Documentation/index.rst               |   1 +
>  Documentation/intro/install/afxdp.rst | 425 ++++++++++++++++
>  Documentation/intro/install/index.rst |   1 +
>  acinclude.m4                          |  35 ++
>  configure.ac                          |   1 +
>  lib/automake.mk                       |  14 +
>  lib/dp-packet.c                       |  28 ++
>  lib/dp-packet.h                       |  18 +-
>  lib/dpif-netdev-perf.h                |  26 +
>  lib/netdev-afxdp.c                    | 891 ++++++++++++++++++++++++++++++++++
>  lib/netdev-afxdp.h                    |  74 +++
>  lib/netdev-linux-private.h            | 138 ++++++
>  lib/netdev-linux.c                    | 121 ++---
>  lib/netdev-provider.h                 |   3 +
>  lib/netdev.c                          |  11 +
>  lib/spinlock.h                        |  70 +++
>  lib/util.c                            |  92 +++-
>  lib/util.h                            |   5 +
>  lib/xdpsock.c                         | 170 +++++++
>  lib/xdpsock.h                         | 101 ++++
>  tests/automake.mk                     |  16 +
>  tests/system-afxdp-macros.at          |  20 +
>  tests/system-afxdp-testsuite.at       |  26 +
>  vswitchd/vswitch.xml                  |  30 ++
>  25 files changed, 2210 insertions(+), 108 deletions(-)
>  create mode 100644 Documentation/intro/install/afxdp.rst
>  create mode 100644 lib/netdev-afxdp.c
>  create mode 100644 lib/netdev-afxdp.h
>  create mode 100644 lib/netdev-linux-private.h
>  create mode 100644 lib/spinlock.h
>  create mode 100644 lib/xdpsock.c
>  create mode 100644 lib/xdpsock.h
>  create mode 100644 tests/system-afxdp-macros.at
>  create mode 100644 tests/system-afxdp-testsuite.at
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 082438e09a33..11cc59efc881 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>  	Documentation/intro/why-ovs.rst \
>  	Documentation/intro/install/index.rst \
>  	Documentation/intro/install/bash-completion.rst \
> +	Documentation/intro/install/afxdp.rst \
>  	Documentation/intro/install/debian.rst \
>  	Documentation/intro/install/documentation.rst \
>  	Documentation/intro/install/distributions.rst \
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 46261235c732..aa9e7c49f179 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -59,6 +59,7 @@ vSwitch? Start here.
>    :doc:`intro/install/windows` |
>    :doc:`intro/install/xenserver` |
>    :doc:`intro/install/dpdk` |
> +  :doc:`intro/install/afxdp` |
>    :doc:`Installation FAQs <faq/releases>`
>  
>  - **Tutorials:** :doc:`tutorials/faucet` |
> diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
> new file mode 100644
> index 000000000000..291df8d45020
> --- /dev/null
> +++ b/Documentation/intro/install/afxdp.rst
> @@ -0,0 +1,425 @@
> +..
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +
> +========================
> +Open vSwitch with AF_XDP
> +========================
> +
> +This document describes how to build and install Open vSwitch using
> +AF_XDP netdev.
> +
> +.. warning::
> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> +  and it is not compiled in by default.
> +
> +
> +Introduction
> +------------
> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> +built upon the eBPF and XDP technology.  It is aims to have comparable
> +performance to DPDK but cooperate better with existing kernel's networking
> +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> +attached to the netdev, by-passing a couple of Linux kernel's subsystems.
> +As a result, AF_XDP socket shows much better performance than AF_PACKET.
> +For more details about AF_XDP, please see linux kernel's
> +Documentation/networking/af_xdp.rst
> +
> +
> +AF_XDP Netdev
> +-------------
> +OVS has a couple of netdev types, i.e., system, tap, or
> +dpdk.  The AF_XDP feature adds a new netdev types called
> +"afxdp", and implement its configuration, packet reception,
> +and transmit functions.  Since the AF_XDP socket, called xsk,
> +operates in userspace, once ovs-vswitchd receives packets
> +from xsk, the afxdp netdev re-uses the existing userspace
> +dpif-netdev datapath.  As a result, most of the packet processing
> +happens at the userspace instead of linux kernel.
> +
> +::
> +
> +              |   +-------------------+
> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> +              |   +-------------------+
> +              |   |      ofproto      |<-->OpenFlow controllers
> +              |   +--------+-+--------+
> +              |   | netdev | |ofproto-|
> +    userspace |   +--------+ |  dpif  |
> +              |   | afxdp  | +--------+
> +              |   | netdev | |  dpif  |
> +              |   +---||---+ +--------+
> +              |       ||     |  dpif- |
> +              |       ||     | netdev |
> +              |_      ||     +--------+
> +                      ||
> +               _  +---||-----+--------+
> +              |   | AF_XDP prog +     |
> +       kernel |   |   xsk_map         |
> +              |_  +--------||---------+
> +                           ||
> +                        physical
> +                           NIC
> +
> +
> +Build requirements
> +------------------
> +
> +In addition to the requirements described in :doc:`general`, building Open
> +vSwitch with AF_XDP will require the following:
> +
> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> +
> +- Linux kernel XDP support, with the following options (required)
> +
> +  * CONFIG_BPF=y
> +
> +  * CONFIG_BPF_SYSCALL=y
> +
> +  * CONFIG_XDP_SOCKETS=y
> +
> +
> +- The following optional Kconfig options are also recommended, but not
> +  required:
> +
> +  * CONFIG_BPF_JIT=y (Performance)
> +
> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> +
> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> +
> +- Once your AF_XDP-enabled kernel is ready, if possible, run
> +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
> +  This is an OVS independent benchmark tools for AF_XDP.
> +  It makes sure your basic kernel requirements are met for AF_XDP.
> +
> +
> +Installing
> +----------
> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
> +First, clone a recent version of Linux bpf-next tree::
> +
> +  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> +
> +Second, go into the Linux source directory and build libbpf in the tools
> +directory::
> +
> +  cd bpf-next/
> +  cd tools/lib/bpf/
> +  make && make install
> +  make install_headers
> +
> +.. note::
> +   Make sure xsk.h and bpf.h are installed in system's library path,
> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> +
> +Make sure the libbpf.so is installed correctly::
> +
> +  ldconfig
> +  ldconfig -p | grep libbpf
> +
> +Third, ensure the standard OVS requirements are installed and
> +bootstrap/configure the package::
> +
> +  ./boot.sh && ./configure --enable-afxdp
> +
> +Finally, build and install OVS::
> +
> +  make && make install
> +
> +To kick start end-to-end autotesting::
> +
> +  uname -a # make sure having 5.0+ kernel
> +  make check-afxdp TESTSUITEFLAGS='1'
> +
> +If a test case fails, check the log at::
> +
> +  cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
> +
> +
> +Setup AF_XDP netdev
> +-------------------
> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> +set-up right::
> +
> +  ldd vswitchd/ovs-vswitchd
> +
> +Open vSwitch should be started using userspace datapath as described
> +in :doc:`general`::
> +
> +  ovs-vswitchd ...
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
> +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Or, use 4 pmds/cores and 4 queues by doing::
> +
> +  ethtool -L enp2s0 combined 4
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=4 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> +
> +.. note::
> +   pmd-rxq-affinity is optional. If not specified, system will auto-assign.
> +
> +To validate that the bridge has successfully instantiated, you can use the::
> +
> +  ovs-vsctl show
> +
> +Should show something like::
> +
> +  Port "ens802f0"
> +   Interface "ens802f0"
> +      type: afxdp
> +      options: {n_rxq="1", xdpmode=drv}
> +
> +Otherwise, enable debugging by::
> +
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +
> +References
> +----------
> +Most of the design details are described in the paper presented at
> +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> +section 4, and slides[2][4].
> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
> +about AF_XDP current and future work.
> +
> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> +
> +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> +
> +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> +
> +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> +
> +
> +Performance Tuning
> +------------------
> +The name of the game is to keep your CPU running in userspace, allowing PMD
> +to keep polling the AF_XDP queues without any interferences from kernel.
> +
> +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
> +   running cores, device plug-in slot)
> +
> +#. Isolate your CPU by doing isolcpu at grub configure.
> +
> +#. IRQ should not set to pmd running core.
> +
> +#. The Spectre and Meltdown fixes increase the overhead of system calls.
> +
> +
> +Debugging performance issue
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +While running the traffic, use linux perf tool to see where your cpu
> +spends its cycle::
> +
> +  cd bpf-next/tools/perf
> +  make
> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> +  ./perf report
> +
> +Measure your system call rate by doing::
> +
> +  pstree -p `pidof ovs-vswitchd`
> +  strace -c -p <your pmd's PID>
> +
> +Or, use OVS pmd tool::
> +
> +  ovs-appctl dpif-netdev/pmd-stats-show
> +
> +
> +Example Script
> +--------------
> +
> +Below is a script using namespaces and veth peer::
> +
> +  #!/bin/bash
> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
> +    --disable-system --detach \
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
> +    fail-mode=secure datapath_type=netdev
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +  ip netns add at_ns0
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> +
> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.1/24" dev p0
> +  ip link set dev p0 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns add at_ns1
> +  ip link add p1 type veth peer name afxdp-p1
> +  ip link set p1 netns at_ns1
> +  ip link set dev afxdp-p1 up
> +
> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.2/24" dev p1
> +  ip link set dev p1 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> +
> +
> +Limitations/Known Issues
> +------------------------
> +#. Device's numa ID is always 0, need a way to find numa id from a netdev.
> +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
> +   work-around is to use OpenFlow meter action.
> +#. AF_XDP device added to bridge, remove, and added again will fail.
> +#. Most of the tests are done using i40e single port. Multiple ports and
> +   also ixgbe driver also needs to be tested.
> +#. No latency test result (TODO items)
> +
> +
> +PVP using tap device
> +--------------------
> +Assume you have enp2s0 as physical nic, and a tap device connected to VM.
> +First, start OVS, then add physical port::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Start a VM with virtio and tap device::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +    -m 4096 \
> +    -cpu host,+x2apic -enable-kvm \
> +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> +    -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +    -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Create OpenFlow rules::
> +
> +  ovs-vsctl add-port br0 tap0 -- set interface tap0
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +
> +PVP using vhostuser device
> +--------------------------
> +First, build OVS with DPDK and AFXDP::
> +
> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> +  make -j4 && make install
> +
> +Create a vhost-user port from OVS::
> +
> +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> +    other_config:pmd-cpu-mask=0xfff
> +  ovs-vsctl add-port br0 vhost-user-1 \
> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> +
> +Start VM using vhost-user mode::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +   -m 4096 \
> +   -cpu host,+x2apic -enable-kvm \
> +   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> +   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> +      netdev=mynet1,mq=on,vectors=10 \
> +   -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +   -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Setup the OpenFlow ruls::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +
> +PCP container using veth
> +------------------------
> +Create namespace and veth peer devices::
> +
> +  ip netns add at_ns0
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ip netns exec at_ns0 ip link set dev p0 up
> +
> +Attach the veth port to br0 (linux kernel mode)::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 options:n_rxq=1
> +
> +Or, use AF_XDP with skb mode::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
> +
> +Setup the OpenFlow rules::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> +
> +In the namespace, run drop or bounce back the packet::
> +
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> +
> +
> +Bug Reporting
> +-------------
> +
> +Please report problems to dev@openvswitch.org.
> diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
> index 3193c736cf17..c27a9c9d16ff 100644
> --- a/Documentation/intro/install/index.rst
> +++ b/Documentation/intro/install/index.rst
> @@ -45,6 +45,7 @@ Installation from Source
>     xenserver
>     userspace
>     dpdk
> +   afxdp
>  
>  Installation from Packages
>  --------------------------
> diff --git a/acinclude.m4 b/acinclude.m4
> index 321a741985db..bb03b504a2a8 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -238,6 +238,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>    ])
>  ])
>  
> +dnl OVS_CHECK_LINUX_AF_XDP
> +dnl
> +dnl Check both Linux kernel AF_XDP and libbpf support
> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> +  AC_ARG_ENABLE([afxdp],
> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
> +                [], [enable_afxdp=no])
> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> +  if test "$enable_afxdp" != yes; then
> +    AC_MSG_RESULT([no])
> +    AF_XDP_ENABLE=false
> +  else
> +    AC_MSG_RESULT([yes])
> +    AF_XDP_ENABLE=true
> +
> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])])
> +
> +    AC_DEFINE([HAVE_AF_XDP], [1],
> +              [Define to 1 if AF_XDP support is available and enabled.])
> +    LIBBPF_LDADD=" -lbpf -lelf"
> +    AC_SUBST([LIBBPF_LDADD])
> +  fi
> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> +])
> +
>  dnl OVS_CHECK_DPDK
>  dnl
>  dnl Configure DPDK source tree
> diff --git a/configure.ac b/configure.ac
> index a9f0a06dc140..36ad246203db 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -98,6 +98,7 @@ OVS_CHECK_SPHINX
>  OVS_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
> +OVS_CHECK_LINUX_AF_XDP
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
>    [], [], [[#include <sys/stat.h>]])
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 1b89cac8c3a2..9b75e47ba396 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -14,6 +14,10 @@ if WIN32
>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> +endif
> +
>  lib_libopenvswitch_la_LDFLAGS = \
>          $(OVS_LTINFO) \
>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> @@ -394,6 +398,7 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/if-notifier.h \
>  	lib/netdev-linux.c \
>  	lib/netdev-linux.h \
> +	lib/netdev-linux-private.h \
>  	lib/netdev-offload-tc.c \
>  	lib/netlink-conntrack.c \
>  	lib/netlink-conntrack.h \
> @@ -410,6 +415,15 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/tc.h
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_SOURCES += \
> +	lib/xdpsock.c \
> +	lib/xdpsock.h \
> +	lib/netdev-afxdp.c \
> +	lib/netdev-afxdp.h \
> +	lib/spinlock.h
> +endif
> +
>  if DPDK_NETDEV
>  lib_libopenvswitch_la_SOURCES += \
>  	lib/dpdk.c \
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 0976a35e758b..e6a7947076b4 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -19,6 +19,7 @@
>  #include <string.h>
>  
>  #include "dp-packet.h"
> +#include "netdev-afxdp.h"
>  #include "netdev-dpdk.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "util.h"
> @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>  }
>  
> +#if HAVE_AF_XDP
> +/* Initialize 'b' as an empty dp_packet that contains
> + * memory starting at AF_XDP umem base.
> + */
> +void
> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
> +{
> +    dp_packet_set_base(b, base);
> +    dp_packet_set_data(b, base);
> +    dp_packet_set_size(b, 0);
> +
> +    dp_packet_set_allocated(b, allocated);
> +    b->source = DPBUF_AFXDP;
> +    dp_packet_reset_offsets(b);
> +    pkt_metadata_init(&b->md, 0);
> +    dp_packet_reset_cutlen(b);
> +    dp_packet_reset_offload(b);
> +    b->packet_type = htonl(PT_ETH);
> +}
> +#endif
> +
>  /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
>   * memory starting at 'base'.  'base' should point to a buffer on the stack.
>   * (Nothing actually relies on 'base' being allocated on the stack.  It could
> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
>               * created as a dp_packet */
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
> +        } else if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
>          }
>      }
>  }
> @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
>      case DPBUF_STACK:
>          OVS_NOT_REACHED();
>  
> +    case DPBUF_AFXDP:
> +        OVS_NOT_REACHED();
> +
>      case DPBUF_STUB:
>          b->source = DPBUF_MALLOC;
>          new_base = xmalloc(new_allocated);
> @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
>  {
>      void *p;
>      ovs_assert(b->source != DPBUF_DPDK);
> +    ovs_assert(b->source != DPBUF_AFXDP);
>  
>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
>          p = dp_packet_data(b);
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index a5e9ade1244a..e3438226e360 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -25,6 +25,7 @@
>  #include <rte_mbuf.h>
>  #endif
>  
> +#include "netdev-afxdp.h"
>  #include "netdev-dpdk.h"
>  #include "openvswitch/list.h"
>  #include "packets.h"
> @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>      DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
>                                  * ref to dp_packet_init_dpdk() in dp-packet.c.
>                                  */
> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>  };
>  
>  #define DP_PACKET_CONTEXT_SIZE 64
> @@ -89,6 +91,13 @@ struct dp_packet {
>      };
>  };
>  
> +#if HAVE_AF_XDP
> +struct dp_packet_afxdp {
> +    struct umem_pool *mpool;
> +    struct dp_packet packet;
> +};
> +#endif
> +
>  static inline void *dp_packet_data(const struct dp_packet *);
>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>  static inline void *dp_packet_base(const struct dp_packet *);
> @@ -122,7 +131,9 @@ static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
>  void dp_packet_use(struct dp_packet *, void *, size_t);
>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
> -
> +#if HAVE_AF_XDP
> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> +#endif
>  void dp_packet_init_dpdk(struct dp_packet *);
>  
>  void dp_packet_init(struct dp_packet *, size_t);
> @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
>              return;
>          }
>  
> +        if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
> +            return;
> +        }
> +
>          dp_packet_uninit(b);
>          free(b);
>      }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 859c05613ddf..6b6dfda7db1c 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -21,6 +21,7 @@
>  #include <stddef.h>
>  #include <stdint.h>
>  #include <string.h>
> +#include <time.h>
>  #include <math.h>
>  
>  #ifdef DPDK_NETDEV
> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
>      char *log_reason;
>  };
>  
> +#ifdef __linux__
> +static inline uint64_t
> +rdtsc_syscall(struct pmd_perf_stats *s)
> +{
> +    struct timespec val;
> +    uint64_t v;
> +
> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
> +       return s->last_tsc;
> +    }
> +
> +    v  = (uint64_t) val.tv_sec * 1000000000LL;
> +    v += (uint64_t) val.tv_nsec;
> +
> +    return s->last_tsc = v;
> +}
> +#endif
> +
>  /* Support for accurate timing of PMD execution on TSC clock cycle level.
>   * These functions are intended to be invoked in the context of pmd threads. */
>  
> @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats *s)
>  {
>  #ifdef DPDK_NETDEV
>      return s->last_tsc = rte_get_tsc_cycles();
> +#elif !defined(_MSC_VER) && defined(__x86_64__)
> +    uint32_t h, l;
> +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
> +
> +    return s->last_tsc = ((uint64_t) h << 32) | l;
> +#elif defined(__linux__)
> +    return rdtsc_syscall(s);
>  #else
>      return s->last_tsc = 0;
>  #endif
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> new file mode 100644
> index 000000000000..33d8612153d5
> --- /dev/null
> +++ b/lib/netdev-afxdp.c
> @@ -0,0 +1,891 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "netdev-linux-private.h"
> +#include "netdev-linux.h"
> +#include "netdev-afxdp.h"
> +
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if_xdp.h>
> +#include <net/if.h>
> +#include <stdlib.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +
> +#include "coverage.h"
> +#include "dp-packet.h"
> +#include "dpif-netdev.h"
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/vlog.h"
> +#include "packets.h"
> +#include "socket-util.h"
> +#include "spinlock.h"
> +#include "util.h"
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +
> +COVERAGE_DEFINE(afxdp_cq_empty);
> +COVERAGE_DEFINE(afxdp_fq_full);
> +
> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +
> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
> +#define UMEM2XPKT(base, i) \
> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
> +                               i * sizeof(struct dp_packet_afxdp))
> +
> +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
> +                                             int mode);
> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> +static void xsk_destroy(struct xsk_socket_info *xsk);
> +static int xsk_configure_all(struct netdev *netdev);
> +static void xsk_destroy_all(struct netdev *netdev);
> +
> +static struct xsk_umem_info *
> +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
> +{
> +    struct xsk_umem_config uconfig OVS_UNUSED;
> +    struct xsk_umem_info *umem;
> +    int ret;
> +    int i;
> +
> +    umem = xcalloc(1, sizeof *umem);
> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
> +                           NULL);
> +    if (ret) {
> +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    umem->buffer = buffer;
> +
> +    /* set-up umem pool */
> +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
> +        VLOG_ERR("umem_pool_init failed");
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct umem_elem *elem;
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + i * FRAME_SIZE);
> +        umem_elem_push(&umem->mpool, elem);
> +    }
> +
> +    /* set-up metadata */
> +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
> +        VLOG_ERR("xpacket_pool_init failed");
> +        umem_pool_cleanup(&umem->mpool);
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> +              umem->xpool.array,
> +              (char *)umem->xpool.array +
> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> +        xpacket->mpool = &umem->mpool;
> +
> +        packet = &xpacket->packet;
> +        packet->source = DPBUF_AFXDP;
> +    }
> +
> +    return umem;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> +                     uint32_t queue_id, int xdpmode)
> +{
> +    struct xsk_socket_config cfg;
> +    struct xsk_socket_info *xsk;
> +    char devname[IF_NAMESIZE];
> +    uint32_t idx = 0, prog_id;
> +    int ret;
> +    int i;
> +
> +    xsk = xcalloc(1, sizeof(*xsk));
> +    xsk->umem = umem;
> +    cfg.rx_size = CONS_NUM_DESCS;
> +    cfg.tx_size = PROD_NUM_DESCS;
> +    cfg.libbpf_flags = 0;
> +
> +    if (xdpmode == XDP_ZEROCOPY) {
> +        cfg.bind_flags = XDP_ZEROCOPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    } else {
> +        cfg.bind_flags = XDP_COPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    }
> +
> +    if (if_indextoname(ifindex, devname) == NULL) {
> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> +                 ifindex, ovs_strerror(errno));
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
> +                             &xsk->rx, &xsk->tx, &cfg);
> +    if (ret) {
> +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> +                 queue_id);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    /* Make sure the built-in AF_XDP program is loaded */
> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> +    if (ret) {
> +        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
> +        xsk_socket__delete(xsk->xsk);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
> +                                   PROD_NUM_DESCS, &idx)) {
> +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue");
> +    }
> +
> +    for (i = 0;
> +         i < PROD_NUM_DESCS * FRAME_SIZE;
> +         i += FRAME_SIZE) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        elem = umem_elem_pop(&xsk->umem->mpool);
> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> +
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> +    }
> +
> +    xsk_ring_prod__submit(&xsk->umem->fq,
> +                          PROD_NUM_DESCS);
> +    return xsk;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> +{
> +    struct xsk_socket_info *xsk;
> +    struct xsk_umem_info *umem;
> +    void *bufs;
> +
> +    /* umem memory region */
> +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> +
> +    /* create AF_XDP socket */
> +    umem = xsk_configure_umem(bufs,
> +                              NUM_FRAMES * FRAME_SIZE,
> +                              xdpmode);
> +    if (!umem) {
> +        free_pagealign(bufs);
> +        return NULL;
> +    }
> +
> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> +    if (!xsk) {
> +        /* clean up umem and xpacket pool */
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free_pagealign(bufs);
> +        umem_pool_cleanup(&umem->mpool);
> +        xpacket_pool_cleanup(&umem->xpool);
> +        free(umem);
> +    }
> +    return xsk;
> +}
> +
> +static int
> +xsk_configure_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk_info;
> +    int i, ifindex, n_rxq;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    n_rxq = netdev_n_rxq(netdev);
> +    dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *));
> +
> +    /* configure each queue */
> +    for (i = 0; i < n_rxq; i++) {
> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> +        xsk_info = xsk_configure(ifindex, i, dev->xdpmode);
> +        if (!xsk_info) {
> +            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
> +            dev->xsks[i] = NULL;
> +            goto err;
> +        }
> +        dev->xsks[i] = xsk_info;
> +        xsk_info->rx_dropped = 0;
> +        xsk_info->tx_dropped = 0;
> +    }
> +
> +    return 0;
> +
> +err:
> +    xsk_destroy_all(netdev);
> +    return EINVAL;
> +}
> +
> +static void
> +xsk_destroy(struct xsk_socket_info *xsk_info)
> +{
> +    struct xsk_umem *umem;
> +
> +    xsk_socket__delete(xsk_info->xsk);
> +    xsk_info->xsk = NULL;
> +
> +    umem = xsk_info->umem->umem;
> +    if (xsk_umem__delete(umem)) {
> +        VLOG_ERR("xsk_umem__delete failed");
> +    }
> +
> +    /* free the packet buffer */
> +    free_pagealign(xsk_info->umem->buffer);
> +
> +    /* cleanup umem pool */
> +    umem_pool_cleanup(&xsk_info->umem->mpool);
> +
> +    /* cleanup metadata pool */
> +    xpacket_pool_cleanup(&xsk_info->umem->xpool);
> +
> +    free(xsk_info->umem);
> +    free(xsk_info);
> +}
> +
> +static void
> +xsk_destroy_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    for (i = 0; i < netdev_n_rxq(netdev); i++) {
> +        if (dev->xsks && dev->xsks[i]) {
> +            VLOG_INFO("destroy xsk[%d]", i);
> +            xsk_destroy(dev->xsks[i]);
> +            dev->xsks[i] = NULL;
> +        }
> +    }
> +
> +    VLOG_INFO("remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +
> +    free(dev->xsks);
> +}
> +
> +static inline void OVS_UNUSED
> +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> +    struct xdp_statistics stat;
> +    socklen_t optlen;
> +
> +    optlen = sizeof stat;
> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
> +               &stat, &optlen) == 0);
> +
> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
> +                stat.rx_dropped,
> +                stat.rx_invalid_descs,
> +                stat.tx_invalid_descs);
> +}
> +
> +int
> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> +                        char **errp OVS_UNUSED)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    const char *str_xdpmode;
> +    int xdpmode, new_n_rxq;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +    if (new_n_rxq > MAX_XSKQ) {
> +        ovs_mutex_unlock(&dev->mutex);
> +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
> +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
> +        return EINVAL;
> +    }
> +
> +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
> +    if (!strcasecmp(str_xdpmode, "drv")) {
> +        xdpmode = XDP_ZEROCOPY;
> +    } else if (!strcasecmp(str_xdpmode, "skb")) {
> +        xdpmode = XDP_COPY;
> +    } else {
> +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
> +                 netdev_get_name(netdev), str_xdpmode);
> +        ovs_mutex_unlock(&dev->mutex);
> +        return EINVAL;
> +    }
> +
> +    if (dev->requested_n_rxq != new_n_rxq
> +        || dev->requested_xdpmode != xdpmode) {
> +        dev->requested_n_rxq = new_n_rxq;
> +        dev->requested_xdpmode = xdpmode;
> +        netdev_request_reconfigure(netdev);
> +    }
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> +    smap_add_format(args, "xdpmode", "%s",
> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +static void
> +netdev_afxdp_alloc_txq(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int n_txqs = netdev_n_rxq(netdev);
> +    int i;
> +
> +    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
> +
> +    for (i = 0; i < n_txqs; i++) {
> +        ovs_spinlock_init(&dev->tx_locks[i]);
> +    }
> +}
> +
> +int
> +netdev_afxdp_reconfigure(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +    int err = 0;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    if (netdev->n_rxq == dev->requested_n_rxq
> +        && dev->xdpmode == dev->requested_xdpmode) {
> +        goto out;
> +    }
> +
> +    xsk_destroy_all(netdev);
> +    free(dev->tx_locks);
> +
> +    netdev->n_rxq = dev->requested_n_rxq;
> +    netdev_afxdp_alloc_txq(netdev);
> +
> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> +        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
> +        /* From SKB mode to DRV mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> +        dev->xdpmode = XDP_ZEROCOPY;
> +
> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> +                      ovs_strerror(errno));
> +        }
> +    } else {
> +        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
> +        /* From DRV mode to SKB mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +        dev->xdp_bind_flags = XDP_COPY;
> +        dev->xdpmode = XDP_COPY;
> +        /* TODO: set rlimit back to previous value
> +         * when no device is in DRV mode.
> +         */
> +    }
> +
> +    err = xsk_configure_all(netdev);
> +    if (err) {
> +        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
> +    }
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    return err;
> +}
> +
> +int
> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> +{
> +    /* FIXME: Get netdev's PCIe device ID, then find
> +     * its NUMA node id.
> +     */
> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> +              netdev_get_name(netdev));
> +    return 0;
> +}
> +
> +static void
> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> +{
> +    uint32_t prog_id = 0;
> +    uint32_t flags;
> +
> +    /* remove_xdp_program() */
> +    if (xdpmode == XDP_COPY) {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +        VLOG_INFO("%s copy mode", __func__);
> +    } else {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +        VLOG_INFO("%s drv mode", __func__);
> +    }
> +
> +    if (bpf_get_link_xdp_id(ifindex, &prog_id, flags)) {
> +        VLOG_WARN("get xdp program id fails");
> +    }
> +    bpf_set_link_xdp_fd(ifindex, -1, XDP_FLAGS_UPDATE_IF_NOEXIST);
> +}
> +
> +void
> +signal_remove_xdp(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    VLOG_WARN("force remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +}
> +
> +static struct dp_packet_afxdp *
> +dp_packet_cast_afxdp(const struct dp_packet *d)
> +{
> +    ovs_assert(d->source == DPBUF_AFXDP);
> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> +}
> +
> +static inline void
> +prepare_fill_queue(struct xsk_socket_info *xsk_info)
> +{
> +    struct umem_elem *elems[BATCH_SIZE];
> +    struct xsk_umem_info *umem;
> +    unsigned int idx_fq;
> +    int nb_free;
> +    int i, ret;
> +
> +    umem = xsk_info->umem;
> +
> +    nb_free = PROD_NUM_DESCS / 2;
> +    if (xsk_prod_nb_free(&umem->fq, nb_free) < nb_free) {
> +        return;
> +    }


Why you're using 'PROD_NUM_DESCS / 2' here?
IIUC, we're keeping fill queue half-loaded. Isn't it better to
use BATCH_SIZE instead?


> +
> +    ret = umem_elem_pop_n(&umem->mpool, BATCH_SIZE, (void **)elems);
> +    if (OVS_UNLIKELY(ret)) {
> +        return;
> +    }
> +
> +    if (!xsk_ring_prod__reserve(&umem->fq, BATCH_SIZE, &idx_fq)) {
> +        umem_elem_push_n(&umem->mpool, BATCH_SIZE, (void **)elems);
> +        COVERAGE_INC(afxdp_fq_full);
> +        return;
> +    }
> +
> +    for (i = 0; i < BATCH_SIZE; i++) {
> +        uint64_t index;
> +        struct umem_elem *elem;
> +
> +        elem = elems[i];
> +        index = (uint64_t)((char *)elem - (char *)umem->buffer);
> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> +        *xsk_ring_prod__fill_addr(&umem->fq, idx_fq) = index;
> +
> +        idx_fq++;
> +    }
> +    xsk_ring_prod__submit(&umem->fq, BATCH_SIZE);
> +}
> +
> +int
> +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> +                      int *qfill)
> +{
> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> +    struct netdev *netdev = rx->up.netdev;
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk_info;
> +    struct xsk_umem_info *umem;
> +    uint32_t idx_rx = 0;
> +    int qid = rxq_->queue_id;
> +    unsigned int rcvd, i;
> +
> +    xsk_info = dev->xsks[qid];
> +    if (!xsk_info || !xsk_info->xsk) {
> +        return 0;

Need to return EAGAIN.

> +    }
> +
> +    prepare_fill_queue(xsk_info);
> +
> +    umem = xsk_info->umem;
> +    rx->fd = xsk_socket__fd(xsk_info->xsk);
> +
> +    rcvd = xsk_ring_cons__peek(&xsk_info->rx, BATCH_SIZE, &idx_rx);
> +    if (!rcvd) {
> +        return 0;

Need to return EAGAIN.

> +    }
> +
> +    /* Setup a dp_packet batch from descriptors in RX queue */
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->addr;
> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->len;
> +        char *pkt = xsk_umem__get_data(umem->buffer, addr);
> +        uint64_t index;
> +
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        index = addr >> FRAME_SHIFT;
> +        xpacket = UMEM2XPKT(umem->xpool.array, index);
> +        packet = &xpacket->packet;
> +
> +        /* Initialize the struct dp_packet */
> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
> +        dp_packet_set_size(packet, len);
> +
> +        /* Add packet into batch, increase batch->count */
> +        dp_packet_batch_add(batch, packet);
> +
> +        idx_rx++;
> +    }
> +    /* Release the RX queue */
> +    xsk_ring_cons__release(&xsk_info->rx, rcvd);
> +
> +    if (qfill) {
> +        /* TODO: return the number of remaining packets in the queue. */
> +        *qfill = 0;
> +    }
> +
> +#ifdef AFXDP_DEBUG
> +    log_xsk_stat(xsk_info);
> +#endif
> +    return 0;
> +}
> +
> +static inline int
> +kick_tx(struct xsk_socket_info *xsk_info)
> +{
> +    int ret;
> +
> +    if (!xsk_info->outstanding_tx) {
> +        return 0;
> +    }
> +
> +    /* This causes system call into kernel's xsk_sendmsg, and
> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> +     */
> +    ret = sendto(xsk_socket__fd(xsk_info->xsk), NULL, 0, MSG_DONTWAIT,
> +                                NULL, 0);
> +    if (OVS_UNLIKELY(ret < 0)) {
> +        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
> +            return errno;
> +        }
> +    }
> +    /* no error, or EBUSY or EAGAIN */
> +    return 0;
> +}
> +
> +void
> +free_afxdp_buf(struct dp_packet *p)
> +{
> +    struct dp_packet_afxdp *xpacket;
> +    uintptr_t addr;
> +
> +    xpacket = dp_packet_cast_afxdp(p);
> +    if (xpacket->mpool) {
> +        void *base = dp_packet_base(p);
> +
> +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> +        umem_elem_push(xpacket->mpool, (void *)addr);
> +    }
> +}
> +
> +static void
> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> +{
> +    struct dp_packet_afxdp *xpacket = NULL;
> +    struct dp_packet *packet;
> +    void *elems[BATCH_SIZE];
> +    uintptr_t addr;
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        xpacket = dp_packet_cast_afxdp(packet);
> +        if (xpacket->mpool) {


Above checking seems useless. Also, if any packet will be
skipped, we'll push trash pointer to mpool.

If you're worrying about the value, you may just assert:

            ovs_assert(xpacket->mpool);

> +            void *base = dp_packet_base(packet);
> +
> +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> +            elems[i] = (void *)addr;
> +        }
> +    }
> +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> +    dp_packet_batch_init(batch);
> +}
> +
> +static inline bool
> +check_free_batch(struct dp_packet_batch *batch)
> +{
> +    struct umem_pool *first_mpool = NULL;
> +    struct dp_packet_afxdp *xpacket;
> +    struct dp_packet *packet;
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        if (packet->source != DPBUF_AFXDP) {
> +            return false;
> +        }
> +        xpacket = dp_packet_cast_afxdp(packet);
> +        if (i == 0) {
> +            first_mpool = xpacket->mpool;
> +            continue;
> +        }
> +        if (xpacket->mpool != first_mpool) {
> +            return false;
> +        }
> +    }
> +    /* All packets are DPBUF_AFXDP and from the same mpool */
> +    return true;
> +}
> +
> +static inline void
> +afxdp_complete_tx(struct xsk_socket_info *xsk_info)
> +{
> +    struct umem_elem *elems_push[BATCH_SIZE];
> +    struct xsk_umem_info *umem;
> +    uint32_t idx_cq = 0;
> +    int tx_to_free = 0;
> +    int tx_done, j;
> +
> +    umem = xsk_info->umem;
> +    tx_done = xsk_ring_cons__peek(&umem->cq, BATCH_SIZE, &idx_cq);
> +
> +    /* Recycle back to umem pool */
> +    for (j = 0; j < tx_done; j++) {
> +        struct umem_elem *elem;
> +        uint64_t *addr;
> +
> +        addr = (uint64_t *)xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
> +        if (*addr == 0) {

'addr' is an offset from 'umem->buffer'. Zero seems a valid value.
Maybe it's better to use UINT64_MAX instead?

> +            /* The elem has been pushed already */
> +            continue;
> +        }
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + *addr);
> +        elems_push[tx_to_free] = elem;
> +        *addr = 0; /* Mark as pushed */
> +        tx_to_free++;
> +    }
> +
> +    umem_elem_push_n(&umem->mpool, tx_to_free, (void **)elems_push);
> +
> +    if (tx_done > 0) {
> +        xsk_ring_cons__release(&umem->cq, tx_done);
> +        xsk_info->outstanding_tx -= tx_done;

We, probably, should substract the 'tx_to_free' instead and do this
outside of the 'if'.

> +    } else {
> +        COVERAGE_INC(afxdp_cq_empty);
> +    }
> +}
William Tu June 27, 2019, 10:29 p.m. UTC | #9
Hi Ilya,

Thanks for the feedback.

<snip>
> > +static struct dp_packet_afxdp *
> > +dp_packet_cast_afxdp(const struct dp_packet *d)
> > +{
> > +    ovs_assert(d->source == DPBUF_AFXDP);
> > +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> > +}
> > +
> > +static inline void
> > +prepare_fill_queue(struct xsk_socket_info *xsk_info)
> > +{
> > +    struct umem_elem *elems[BATCH_SIZE];
> > +    struct xsk_umem_info *umem;
> > +    unsigned int idx_fq;
> > +    int nb_free;
> > +    int i, ret;
> > +
> > +    umem = xsk_info->umem;
> > +
> > +    nb_free = PROD_NUM_DESCS / 2;
> > +    if (xsk_prod_nb_free(&umem->fq, nb_free) < nb_free) {
> > +        return;
> > +    }
>
>
> Why you're using 'PROD_NUM_DESCS / 2' here?

I don't want to be too aggressive to refill the fq.

> IIUC, we're keeping fill queue half-loaded. Isn't it better to
> use BATCH_SIZE instead?
>
yes, that also works.

>
> > +
> > +    ret = umem_elem_pop_n(&umem->mpool, BATCH_SIZE, (void **)elems);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        return;
> > +    }
> > +
> > +    if (!xsk_ring_prod__reserve(&umem->fq, BATCH_SIZE, &idx_fq)) {
> > +        umem_elem_push_n(&umem->mpool, BATCH_SIZE, (void **)elems);
> > +        COVERAGE_INC(afxdp_fq_full);
> > +        return;
> > +    }
> > +
> > +    for (i = 0; i < BATCH_SIZE; i++) {
> > +        uint64_t index;
> > +        struct umem_elem *elem;
> > +
> > +        elem = elems[i];
> > +        index = (uint64_t)((char *)elem - (char *)umem->buffer);
> > +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> > +        *xsk_ring_prod__fill_addr(&umem->fq, idx_fq) = index;
> > +
> > +        idx_fq++;
> > +    }
> > +    xsk_ring_prod__submit(&umem->fq, BATCH_SIZE);
> > +}
> > +
> > +int
> > +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> > +                      int *qfill)
> > +{
> > +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> > +    struct netdev *netdev = rx->up.netdev;
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct xsk_socket_info *xsk_info;
> > +    struct xsk_umem_info *umem;
> > +    uint32_t idx_rx = 0;
> > +    int qid = rxq_->queue_id;
> > +    unsigned int rcvd, i;
> > +
> > +    xsk_info = dev->xsks[qid];
> > +    if (!xsk_info || !xsk_info->xsk) {
> > +        return 0;
>
> Need to return EAGAIN.

OK

>
> > +    }
> > +
> > +    prepare_fill_queue(xsk_info);
> > +
> > +    umem = xsk_info->umem;
> > +    rx->fd = xsk_socket__fd(xsk_info->xsk);
> > +
> > +    rcvd = xsk_ring_cons__peek(&xsk_info->rx, BATCH_SIZE, &idx_rx);
> > +    if (!rcvd) {
> > +        return 0;
>
> Need to return EAGAIN.

OK

>
> > +    }
> > +
> > +    /* Setup a dp_packet batch from descriptors in RX queue */
> > +    for (i = 0; i < rcvd; i++) {
> > +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->addr;
> > +        uint32_t len = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->len;
> > +        char *pkt = xsk_umem__get_data(umem->buffer, addr);
> > +        uint64_t index;
> > +
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct dp_packet *packet;
> > +
> > +        index = addr >> FRAME_SHIFT;
> > +        xpacket = UMEM2XPKT(umem->xpool.array, index);
> > +        packet = &xpacket->packet;
> > +
> > +        /* Initialize the struct dp_packet */
> > +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
> > +        dp_packet_set_size(packet, len);
> > +
> > +        /* Add packet into batch, increase batch->count */
> > +        dp_packet_batch_add(batch, packet);
> > +
> > +        idx_rx++;
> > +    }
> > +    /* Release the RX queue */
> > +    xsk_ring_cons__release(&xsk_info->rx, rcvd);
> > +
> > +    if (qfill) {
> > +        /* TODO: return the number of remaining packets in the queue. */
> > +        *qfill = 0;
> > +    }
> > +
> > +#ifdef AFXDP_DEBUG
> > +    log_xsk_stat(xsk_info);
> > +#endif
> > +    return 0;
> > +}
> > +
> > +static inline int
> > +kick_tx(struct xsk_socket_info *xsk_info)
> > +{
> > +    int ret;
> > +
> > +    if (!xsk_info->outstanding_tx) {
> > +        return 0;
> > +    }
> > +
> > +    /* This causes system call into kernel's xsk_sendmsg, and
> > +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> > +     */
> > +    ret = sendto(xsk_socket__fd(xsk_info->xsk), NULL, 0, MSG_DONTWAIT,
> > +                                NULL, 0);
> > +    if (OVS_UNLIKELY(ret < 0)) {
> > +        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
> > +            return errno;
> > +        }
> > +    }
> > +    /* no error, or EBUSY or EAGAIN */
> > +    return 0;
> > +}
> > +
> > +void
> > +free_afxdp_buf(struct dp_packet *p)
> > +{
> > +    struct dp_packet_afxdp *xpacket;
> > +    uintptr_t addr;
> > +
> > +    xpacket = dp_packet_cast_afxdp(p);
> > +    if (xpacket->mpool) {
> > +        void *base = dp_packet_base(p);
> > +
> > +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> > +        umem_elem_push(xpacket->mpool, (void *)addr);
> > +    }
> > +}
> > +
> > +static void
> > +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> > +{
> > +    struct dp_packet_afxdp *xpacket = NULL;
> > +    struct dp_packet *packet;
> > +    void *elems[BATCH_SIZE];
> > +    uintptr_t addr;
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        xpacket = dp_packet_cast_afxdp(packet);
> > +        if (xpacket->mpool) {
>
>
> Above checking seems useless. Also, if any packet will be
> skipped, we'll push trash pointer to mpool.
>
Thanks, will skip it.

> If you're worrying about the value, you may just assert:
>
>             ovs_assert(xpacket->mpool);
>
> > +            void *base = dp_packet_base(packet);
> > +
> > +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> > +            elems[i] = (void *)addr;
> > +        }
> > +    }
> > +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> > +    dp_packet_batch_init(batch);
> > +}
> > +
> > +static inline bool
> > +check_free_batch(struct dp_packet_batch *batch)
> > +{
> > +    struct umem_pool *first_mpool = NULL;
> > +    struct dp_packet_afxdp *xpacket;
> > +    struct dp_packet *packet;
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        if (packet->source != DPBUF_AFXDP) {
> > +            return false;
> > +        }
> > +        xpacket = dp_packet_cast_afxdp(packet);
> > +        if (i == 0) {
> > +            first_mpool = xpacket->mpool;
> > +            continue;
> > +        }
> > +        if (xpacket->mpool != first_mpool) {
> > +            return false;
> > +        }
> > +    }
> > +    /* All packets are DPBUF_AFXDP and from the same mpool */
> > +    return true;
> > +}
> > +
> > +static inline void
> > +afxdp_complete_tx(struct xsk_socket_info *xsk_info)
> > +{
> > +    struct umem_elem *elems_push[BATCH_SIZE];
> > +    struct xsk_umem_info *umem;
> > +    uint32_t idx_cq = 0;
> > +    int tx_to_free = 0;
> > +    int tx_done, j;
> > +
> > +    umem = xsk_info->umem;
> > +    tx_done = xsk_ring_cons__peek(&umem->cq, BATCH_SIZE, &idx_cq);
> > +
> > +    /* Recycle back to umem pool */
> > +    for (j = 0; j < tx_done; j++) {
> > +        struct umem_elem *elem;
> > +        uint64_t *addr;
> > +
> > +        addr = (uint64_t *)xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
> > +        if (*addr == 0) {
>
> 'addr' is an offset from 'umem->buffer'. Zero seems a valid value.
> Maybe it's better to use UINT64_MAX instead?

Thanks a lot! I shouldn't use zero, will switch to use UINT64_MAX.

>
> > +            /* The elem has been pushed already */
> > +            continue;
> > +        }
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)umem->buffer + *addr);
> > +        elems_push[tx_to_free] = elem;
> > +        *addr = 0; /* Mark as pushed */
> > +        tx_to_free++;
> > +    }
> > +
> > +    umem_elem_push_n(&umem->mpool, tx_to_free, (void **)elems_push);
> > +
> > +    if (tx_done > 0) {
> > +        xsk_ring_cons__release(&umem->cq, tx_done);
> > +        xsk_info->outstanding_tx -= tx_done;
>
> We, probably, should substract the 'tx_to_free' instead and do this
> outside of the 'if'.
>
OK

--William
Ilya Maximets June 28, 2019, 12:58 p.m. UTC | #10
Few more bits.

On 19.06.2019 22:51, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---

<snip>

> +int
> +netdev_afxdp_batch_send(struct netdev *netdev, int qid,
> +                        struct dp_packet_batch *batch,
> +                        bool concurrent_txq)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk_info = dev->xsks[qid];

You're remapping 'qid' below, but using old 'xsk_info'.

> +    struct umem_elem *elems_pop[BATCH_SIZE];
> +    struct xsk_umem_info *umem;
> +    struct dp_packet *packet;
> +    bool free_batch = true;

This must be 'false' by default.

> +    uint32_t idx = 0;
> +    int error = 0;
> +    int ret;
> +
> +    if (OVS_UNLIKELY(concurrent_txq)) {
> +        qid = qid % dev->up.n_txq;
> +        ovs_spin_lock(&dev->tx_locks[qid]);
> +    }
> +
> +    if (!xsk_info || !xsk_info->xsk) {
> +        goto out;
> +    }
> +
> +    afxdp_complete_tx(xsk_info);
> +
> +    free_batch = check_free_batch(batch);
> +
> +    umem = xsk_info->umem;
> +    ret = umem_elem_pop_n(&umem->mpool, batch->count, (void **)elems_pop);
> +    if (OVS_UNLIKELY(ret)) {
> +        xsk_info->tx_dropped += batch->count;
> +        error = ENOMEM;
> +        goto out;
> +    }
> +
> +    /* Make sure we have enough TX descs */
> +    ret = xsk_ring_prod__reserve(&xsk_info->tx, batch->count, &idx);
> +    if (OVS_UNLIKELY(ret == 0)) {
> +        umem_elem_push_n(&umem->mpool, batch->count, (void **)elems_pop);
> +        xsk_info->tx_dropped += batch->count;
> +        error = ENOMEM;
> +        goto out;
> +    }
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        struct umem_elem *elem;
> +        uint64_t index;
> +
> +        elem = elems_pop[i];
> +        /* Copy the packet to the umem we just pop from umem pool.
> +         * TODO: avoid this copy if the packet and the pop umem
> +         * are located in the same umem.
> +         */
> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> +
> +        index = (uint64_t)((char *)elem - (char *)umem->buffer);
> +        xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->addr = index;
> +        xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->len
> +            = dp_packet_size(packet);
> +    }
> +    xsk_ring_prod__submit(&xsk_info->tx, batch->count);
> +    xsk_info->outstanding_tx += batch->count;
> +
> +    ret = kick_tx(xsk_info);
> +    if (OVS_UNLIKELY(ret)) {
> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> +                     ovs_strerror(ret));
> +    }
> +
> +out:
> +    if (free_batch) {
> +        free_afxdp_buf_batch(batch);
> +    } else {
> +        dp_packet_delete_batch(batch, true);
> +    }
> +
> +    if (OVS_UNLIKELY(concurrent_txq)) {
> +        ovs_spin_unlock(&dev->tx_locks[qid]);
> +    }
> +    return error;
> +}

<snip>

> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index bf4b6f8dc621..1f020e1c3825 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -3106,6 +3106,36 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          </p>
>        </column>
>  
> +      <column name="other_config" key="xdpmode"
> +              type='{"type": "string",
> +                     "enum": ["set", ["skb", "drv"]]}'>
> +        <p>
> +          Specifies the operational mode of the XDP program.
> +          If "drv", the XDP program is loaded into the device driver with
> +          zero-copy RX and TX enabled. This mode requires device driver with
> +          AF_XDP support and has the best performance.
> +          If "skb", the XDP program is using generic XDP mode in kernel with
> +          extra data copying between userspace and kernel. No device driver
> +          support is needed. Note that this is afxdp netdev type only.
> +          Defaults to "skb" mode.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="xdpmode"
> +              type='{"type": "string",
> +                     "enum": ["set", ["skb", "drv"]]}'>
> +        <p>
> +          Specifies the operational mode of the XDP program.
> +          If "drv", the XDP program is loaded into the device driver with
> +          zero-copy RX and TX enabled. This mode requires device driver with
> +          AF_XDP support and has the best performance.
> +          If "skb", the XDP program is using generic XDP mode in kernel with
> +          extra data copying between userspace and kernel. No device driver
> +          support is needed. Note that this is afxdp netdev type only.
> +          Defaults to "skb" mode.
> +        </p>
> +      </column>
> +

Duplicated docs.


One more thing I noticed is the same issue as you had with completion queue, but
with rx queue. When I'm trying to send traffic from 2 threads to the same port,
I'm starting receiving same pointers from rx ring. Not only the same ring entries,
but there was cases where two identical pointers was stored sequentially in rx ring.
I'm more and more thinking that it's a kernel/libbpf bug. The last bit that left
for checking is the pointers inside the fill queue. All other parts in OVS seems to
work correctly. I'll send more information about the testcase later after re-checking
with the most recent bpf-next.

Best regards, Ilya Maximets.
William Tu June 28, 2019, 4:37 p.m. UTC | #11
>
> > +int
> > +netdev_afxdp_batch_send(struct netdev *netdev, int qid,
> > +                        struct dp_packet_batch *batch,
> > +                        bool concurrent_txq)
> > +{
> > +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +    struct xsk_socket_info *xsk_info = dev->xsks[qid];
>
> You're remapping 'qid' below, but using old 'xsk_info'.
>
> > +    struct umem_elem *elems_pop[BATCH_SIZE];
> > +    struct xsk_umem_info *umem;
> > +    struct dp_packet *packet;
> > +    bool free_batch = true;
>
> This must be 'false' by default.
>
> > +    uint32_t idx = 0;
> > +    int error = 0;
> > +    int ret;
> > +
> > +    if (OVS_UNLIKELY(concurrent_txq)) {
> > +        qid = qid % dev->up.n_txq;
> > +        ovs_spin_lock(&dev->tx_locks[qid]);
> > +    }
> > +
> > +    if (!xsk_info || !xsk_info->xsk) {
> > +        goto out;
> > +    }
> > +
> > +    afxdp_complete_tx(xsk_info);
> > +
> > +    free_batch = check_free_batch(batch);
> > +
> > +    umem = xsk_info->umem;
> > +    ret = umem_elem_pop_n(&umem->mpool, batch->count, (void **)elems_pop);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        xsk_info->tx_dropped += batch->count;
> > +        error = ENOMEM;
> > +        goto out;
> > +    }
> > +
> > +    /* Make sure we have enough TX descs */
> > +    ret = xsk_ring_prod__reserve(&xsk_info->tx, batch->count, &idx);
> > +    if (OVS_UNLIKELY(ret == 0)) {
> > +        umem_elem_push_n(&umem->mpool, batch->count, (void **)elems_pop);
> > +        xsk_info->tx_dropped += batch->count;
> > +        error = ENOMEM;
> > +        goto out;
> > +    }
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        struct umem_elem *elem;
> > +        uint64_t index;
> > +
> > +        elem = elems_pop[i];
> > +        /* Copy the packet to the umem we just pop from umem pool.
> > +         * TODO: avoid this copy if the packet and the pop umem
> > +         * are located in the same umem.
> > +         */
> > +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
> > +
> > +        index = (uint64_t)((char *)elem - (char *)umem->buffer);
> > +        xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->addr = index;
> > +        xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->len
> > +            = dp_packet_size(packet);
> > +    }
> > +    xsk_ring_prod__submit(&xsk_info->tx, batch->count);
> > +    xsk_info->outstanding_tx += batch->count;
> > +
> > +    ret = kick_tx(xsk_info);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +    }
> > +
> > +out:
> > +    if (free_batch) {
> > +        free_afxdp_buf_batch(batch);
> > +    } else {
> > +        dp_packet_delete_batch(batch, true);
> > +    }
> > +
> > +    if (OVS_UNLIKELY(concurrent_txq)) {
> > +        ovs_spin_unlock(&dev->tx_locks[qid]);
> > +    }
> > +    return error;
> > +}
>
> <snip>
>
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index bf4b6f8dc621..1f020e1c3825 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -3106,6 +3106,36 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
> >          </p>
> >        </column>
> >
> > +      <column name="other_config" key="xdpmode"
> > +              type='{"type": "string",
> > +                     "enum": ["set", ["skb", "drv"]]}'>
> > +        <p>
> > +          Specifies the operational mode of the XDP program.
> > +          If "drv", the XDP program is loaded into the device driver with
> > +          zero-copy RX and TX enabled. This mode requires device driver with
> > +          AF_XDP support and has the best performance.
> > +          If "skb", the XDP program is using generic XDP mode in kernel with
> > +          extra data copying between userspace and kernel. No device driver
> > +          support is needed. Note that this is afxdp netdev type only.
> > +          Defaults to "skb" mode.
> > +        </p>
> > +      </column>
> > +
> > +      <column name="other_config" key="xdpmode"
> > +              type='{"type": "string",
> > +                     "enum": ["set", ["skb", "drv"]]}'>
> > +        <p>
> > +          Specifies the operational mode of the XDP program.
> > +          If "drv", the XDP program is loaded into the device driver with
> > +          zero-copy RX and TX enabled. This mode requires device driver with
> > +          AF_XDP support and has the best performance.
> > +          If "skb", the XDP program is using generic XDP mode in kernel with
> > +          extra data copying between userspace and kernel. No device driver
> > +          support is needed. Note that this is afxdp netdev type only.
> > +          Defaults to "skb" mode.
> > +        </p>
> > +      </column>
> > +
Thanks! I will fix the above 3 places.

>
> Duplicated docs.
>
>
> One more thing I noticed is the same issue as you had with completion queue, but
> with rx queue. When I'm trying to send traffic from 2 threads to the same port,

Is the 2 threads send traffic using afxdp tx?

> I'm starting receiving same pointers from rx ring. Not only the same ring entries,
> but there was cases where two identical pointers was stored sequentially in rx ring.

I use similar way as used in completion queue (assign UINT64_MAX to rx ring
at netdev_afxdp_rxq_recv) but do not see any identical pointers.

> I'm more and more thinking that it's a kernel/libbpf bug. The last bit that left
> for checking is the pointers inside the fill queue. All other parts in OVS seems to
> work correctly. I'll send more information about the testcase later after re-checking
> with the most recent bpf-next.
>

Look forward to your investigation! Thanks a lot.
William
Ilya Maximets July 2, 2019, 3:10 p.m. UTC | #12
On 28.06.2019 19:37, William Tu wrote:
>>
>>
>> One more thing I noticed is the same issue as you had with completion queue, but
>> with rx queue. When I'm trying to send traffic from 2 threads to the same port,
> 
> Is the 2 threads send traffic using afxdp tx?

Yes.

> 
>> I'm starting receiving same pointers from rx ring. Not only the same ring entries,
>> but there was cases where two identical pointers was stored sequentially in rx ring.
> 
> I use similar way as used in completion queue (assign UINT64_MAX to rx ring
> at netdev_afxdp_rxq_recv) but do not see any identical pointers.
> 
>> I'm more and more thinking that it's a kernel/libbpf bug. The last bit that left
>> for checking is the pointers inside the fill queue. All other parts in OVS seems to
>> work correctly. I'll send more information about the testcase later after re-checking
>> with the most recent bpf-next.
>>
> 
> Look forward to your investigation! Thanks a lot.

It was a kernel bug that generic receive path doesn't have any locks,
but generic receive could be triggered from different cores at the same
time breaking the rx an fill queues. I tried to run 2 traffic flows over
the veth pair, one side of which was opened by netdev-afxdp in OVS. And
OVS constantly crashed because two kernel threads tried to allocate same
addresses from fill queue and pushed them to rx queue. That is the root
cause of duplicated addresses in RX queue. Data in these descriptors
most probably was corrupted too.

I've send a patch for this issue:
    https://lore.kernel.org/bpf/20190702143634.19688-1-i.maximets@samsung.com/

I'm still having some troubles with this scenario. Sometimes the traffic
simply stops flowing. But this seems a different issue. Most likely, one
more kernel issue...
However, OVS doesn't crash for me anymore. And this is good news.


-------------------------
Full testcase description
-------------------------
ip netns add at_ns0
ip netns add at_ns1

ip link add p0 type veth peer name patch-p0
ethtool -K p0 tx off rxvlan off txvlan off
  
ip link set p0 netns at_ns0  
ip link set dev patch-p0 up
ip link set dev patch-p0 promisc on
  
ip netns exec at_ns0 ip addr add "10.1.1.1/24" dev p0
ip netns exec at_ns0 ip link set dev p0 up

ip link add p1 type veth peer name patch-p1
ethtool -K p1 tx off rxvlan off txvlan off

ip link set p1 netns at_ns1
ip link set dev patch-p1 up
ip link set dev patch-p1 promisc on

ip netns exec at_ns1 ip addr add "10.1.1.2/24" dev p1
ip netns exec at_ns1 ip link set dev p1 up

<start OVS and add patch-p0 and patch-p1 as afxdp ports>

# up the internal port of ovs bridge
ip link set dev br0 up
ip addr add dev br0 10.1.1.13/24


[shell#1] ip netns exec at_ns1 iperf3 -s
[shell#2] ip netns exec at_ns1 iperf3 -s -p 5008
[shell#3] ip netns exec at_ns0 iperf3 -c 10.1.1.2 -t 3600

[shell#4] iperf3 -c 10.1.1.2 -t 3600 -p 5008 # Works via internal port.

<Observe OVS crash>

-----
For this testcase to work you need 'skb_unclone' patch applied in kernel,
otherwise TCP traffic will not flow.


Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index 082438e09a33..11cc59efc881 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -10,6 +10,7 @@  DOC_SOURCE = \
 	Documentation/intro/why-ovs.rst \
 	Documentation/intro/install/index.rst \
 	Documentation/intro/install/bash-completion.rst \
+	Documentation/intro/install/afxdp.rst \
 	Documentation/intro/install/debian.rst \
 	Documentation/intro/install/documentation.rst \
 	Documentation/intro/install/distributions.rst \
diff --git a/Documentation/index.rst b/Documentation/index.rst
index 46261235c732..aa9e7c49f179 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -59,6 +59,7 @@  vSwitch? Start here.
   :doc:`intro/install/windows` |
   :doc:`intro/install/xenserver` |
   :doc:`intro/install/dpdk` |
+  :doc:`intro/install/afxdp` |
   :doc:`Installation FAQs <faq/releases>`
 
 - **Tutorials:** :doc:`tutorials/faucet` |
diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
new file mode 100644
index 000000000000..291df8d45020
--- /dev/null
+++ b/Documentation/intro/install/afxdp.rst
@@ -0,0 +1,425 @@ 
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+
+========================
+Open vSwitch with AF_XDP
+========================
+
+This document describes how to build and install Open vSwitch using
+AF_XDP netdev.
+
+.. warning::
+  The AF_XDP support of Open vSwitch is considered 'experimental',
+  and it is not compiled in by default.
+
+
+Introduction
+------------
+AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
+built upon the eBPF and XDP technology.  It is aims to have comparable
+performance to DPDK but cooperate better with existing kernel's networking
+stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
+attached to the netdev, by-passing a couple of Linux kernel's subsystems.
+As a result, AF_XDP socket shows much better performance than AF_PACKET.
+For more details about AF_XDP, please see linux kernel's
+Documentation/networking/af_xdp.rst
+
+
+AF_XDP Netdev
+-------------
+OVS has a couple of netdev types, i.e., system, tap, or
+dpdk.  The AF_XDP feature adds a new netdev types called
+"afxdp", and implement its configuration, packet reception,
+and transmit functions.  Since the AF_XDP socket, called xsk,
+operates in userspace, once ovs-vswitchd receives packets
+from xsk, the afxdp netdev re-uses the existing userspace
+dpif-netdev datapath.  As a result, most of the packet processing
+happens at the userspace instead of linux kernel.
+
+::
+
+              |   +-------------------+
+              |   |    ovs-vswitchd   |<-->ovsdb-server
+              |   +-------------------+
+              |   |      ofproto      |<-->OpenFlow controllers
+              |   +--------+-+--------+
+              |   | netdev | |ofproto-|
+    userspace |   +--------+ |  dpif  |
+              |   | afxdp  | +--------+
+              |   | netdev | |  dpif  |
+              |   +---||---+ +--------+
+              |       ||     |  dpif- |
+              |       ||     | netdev |
+              |_      ||     +--------+
+                      ||
+               _  +---||-----+--------+
+              |   | AF_XDP prog +     |
+       kernel |   |   xsk_map         |
+              |_  +--------||---------+
+                           ||
+                        physical
+                           NIC
+
+
+Build requirements
+------------------
+
+In addition to the requirements described in :doc:`general`, building Open
+vSwitch with AF_XDP will require the following:
+
+- libbpf from kernel source tree (kernel 5.0.0 or later)
+
+- Linux kernel XDP support, with the following options (required)
+
+  * CONFIG_BPF=y
+
+  * CONFIG_BPF_SYSCALL=y
+
+  * CONFIG_XDP_SOCKETS=y
+
+
+- The following optional Kconfig options are also recommended, but not
+  required:
+
+  * CONFIG_BPF_JIT=y (Performance)
+
+  * CONFIG_HAVE_BPF_JIT=y (Performance)
+
+  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
+
+- Once your AF_XDP-enabled kernel is ready, if possible, run
+  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
+  This is an OVS independent benchmark tools for AF_XDP.
+  It makes sure your basic kernel requirements are met for AF_XDP.
+
+
+Installing
+----------
+For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
+First, clone a recent version of Linux bpf-next tree::
+
+  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
+
+Second, go into the Linux source directory and build libbpf in the tools
+directory::
+
+  cd bpf-next/
+  cd tools/lib/bpf/
+  make && make install
+  make install_headers
+
+.. note::
+   Make sure xsk.h and bpf.h are installed in system's library path,
+   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
+
+Make sure the libbpf.so is installed correctly::
+
+  ldconfig
+  ldconfig -p | grep libbpf
+
+Third, ensure the standard OVS requirements are installed and
+bootstrap/configure the package::
+
+  ./boot.sh && ./configure --enable-afxdp
+
+Finally, build and install OVS::
+
+  make && make install
+
+To kick start end-to-end autotesting::
+
+  uname -a # make sure having 5.0+ kernel
+  make check-afxdp TESTSUITEFLAGS='1'
+
+If a test case fails, check the log at::
+
+  cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
+
+
+Setup AF_XDP netdev
+-------------------
+Before running OVS with AF_XDP, make sure the libbpf and libelf are
+set-up right::
+
+  ldd vswitchd/ovs-vswitchd
+
+Open vSwitch should be started using userspace datapath as described
+in :doc:`general`::
+
+  ovs-vswitchd ...
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
+on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
+pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Or, use 4 pmds/cores and 4 queues by doing::
+
+  ethtool -L enp2s0 combined 4
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=4 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
+
+.. note::
+   pmd-rxq-affinity is optional. If not specified, system will auto-assign.
+
+To validate that the bridge has successfully instantiated, you can use the::
+
+  ovs-vsctl show
+
+Should show something like::
+
+  Port "ens802f0"
+   Interface "ens802f0"
+      type: afxdp
+      options: {n_rxq="1", xdpmode=drv}
+
+Otherwise, enable debugging by::
+
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+
+References
+----------
+Most of the design details are described in the paper presented at
+Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
+section 4, and slides[2][4].
+"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
+about AF_XDP current and future work.
+
+[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
+
+[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
+
+[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
+
+[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
+
+
+Performance Tuning
+------------------
+The name of the game is to keep your CPU running in userspace, allowing PMD
+to keep polling the AF_XDP queues without any interferences from kernel.
+
+#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
+   running cores, device plug-in slot)
+
+#. Isolate your CPU by doing isolcpu at grub configure.
+
+#. IRQ should not set to pmd running core.
+
+#. The Spectre and Meltdown fixes increase the overhead of system calls.
+
+
+Debugging performance issue
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+While running the traffic, use linux perf tool to see where your cpu
+spends its cycle::
+
+  cd bpf-next/tools/perf
+  make
+  ./perf record -p `pidof ovs-vswitchd` sleep 10
+  ./perf report
+
+Measure your system call rate by doing::
+
+  pstree -p `pidof ovs-vswitchd`
+  strace -c -p <your pmd's PID>
+
+Or, use OVS pmd tool::
+
+  ovs-appctl dpif-netdev/pmd-stats-show
+
+
+Example Script
+--------------
+
+Below is a script using namespaces and veth peer::
+
+  #!/bin/bash
+  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
+    --disable-system --detach \
+  ovs-vsctl -- add-br br0 -- set Bridge br0 \
+    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
+    fail-mode=secure datapath_type=netdev
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
+
+  ip netns add at_ns0
+  ovs-appctl vlog/set netdev_afxdp::dbg
+
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
+
+  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.1/24" dev p0
+  ip link set dev p0 up
+  NS_EXEC_HEREDOC
+
+  ip netns add at_ns1
+  ip link add p1 type veth peer name afxdp-p1
+  ip link set p1 netns at_ns1
+  ip link set dev afxdp-p1 up
+
+  ovs-vsctl add-port br0 afxdp-p1 -- \
+    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
+  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
+  ip addr add "10.1.1.2/24" dev p1
+  ip link set dev p1 up
+  NS_EXEC_HEREDOC
+
+  ip netns exec at_ns0 ping -i .2 10.1.1.2
+
+
+Limitations/Known Issues
+------------------------
+#. Device's numa ID is always 0, need a way to find numa id from a netdev.
+#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
+   work-around is to use OpenFlow meter action.
+#. AF_XDP device added to bridge, remove, and added again will fail.
+#. Most of the tests are done using i40e single port. Multiple ports and
+   also ixgbe driver also needs to be tested.
+#. No latency test result (TODO items)
+
+
+PVP using tap device
+--------------------
+Assume you have enp2s0 as physical nic, and a tap device connected to VM.
+First, start OVS, then add physical port::
+
+  ethtool -L enp2s0 combined 1
+  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
+  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
+    options:n_rxq=1 options:xdpmode=drv \
+    other_config:pmd-rxq-affinity="0:4"
+
+Start a VM with virtio and tap device::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+    -m 4096 \
+    -cpu host,+x2apic -enable-kvm \
+    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
+      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
+    -netdev type=tap,id=net0,vhost=on,queues=8 \
+    -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+    -numa node,memdev=mem -mem-prealloc -smp 2
+
+Create OpenFlow rules::
+
+  ovs-vsctl add-port br0 tap0 -- set interface tap0
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
+  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+
+PVP using vhostuser device
+--------------------------
+First, build OVS with DPDK and AFXDP::
+
+  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
+  make -j4 && make install
+
+Create a vhost-user port from OVS::
+
+  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
+  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
+    other_config:pmd-cpu-mask=0xfff
+  ovs-vsctl add-port br0 vhost-user-1 \
+    -- set Interface vhost-user-1 type=dpdkvhostuser
+
+Start VM using vhost-user mode::
+
+  qemu-system-x86_64 -hda ubuntu1810.qcow \
+   -m 4096 \
+   -cpu host,+x2apic -enable-kvm \
+   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
+   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
+   -device virtio-net-pci,mac=00:00:00:00:00:01,\
+      netdev=mynet1,mq=on,vectors=10 \
+   -object memory-backend-file,id=mem,size=4096M,\
+      mem-path=/dev/hugepages,share=on \
+   -numa node,memdev=mem -mem-prealloc -smp 2
+
+Setup the OpenFlow ruls::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
+  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
+
+Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
+
+  ./xdp_rxq_info --dev ens3 --action XDP_DROP
+  ./xdp_rxq_info --dev ens3 --action XDP_TX
+
+
+PCP container using veth
+------------------------
+Create namespace and veth peer devices::
+
+  ip netns add at_ns0
+  ip link add p0 type veth peer name afxdp-p0
+  ip link set p0 netns at_ns0
+  ip link set dev afxdp-p0 up
+  ip netns exec at_ns0 ip link set dev p0 up
+
+Attach the veth port to br0 (linux kernel mode)::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 options:n_rxq=1
+
+Or, use AF_XDP with skb mode::
+
+  ovs-vsctl add-port br0 afxdp-p0 -- \
+    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
+
+Setup the OpenFlow rules::
+
+  ovs-ofctl del-flows br0
+  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
+  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
+
+In the namespace, run drop or bounce back the packet::
+
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
+  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
+
+
+Bug Reporting
+-------------
+
+Please report problems to dev@openvswitch.org.
diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
index 3193c736cf17..c27a9c9d16ff 100644
--- a/Documentation/intro/install/index.rst
+++ b/Documentation/intro/install/index.rst
@@ -45,6 +45,7 @@  Installation from Source
    xenserver
    userspace
    dpdk
+   afxdp
 
 Installation from Packages
 --------------------------
diff --git a/acinclude.m4 b/acinclude.m4
index 321a741985db..bb03b504a2a8 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -238,6 +238,41 @@  AC_DEFUN([OVS_FIND_DEPENDENCY], [
   ])
 ])
 
+dnl OVS_CHECK_LINUX_AF_XDP
+dnl
+dnl Check both Linux kernel AF_XDP and libbpf support
+AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
+  AC_ARG_ENABLE([afxdp],
+                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
+                [], [enable_afxdp=no])
+  AC_MSG_CHECKING([whether AF_XDP is enabled])
+  if test "$enable_afxdp" != yes; then
+    AC_MSG_RESULT([no])
+    AF_XDP_ENABLE=false
+  else
+    AC_MSG_RESULT([yes])
+    AF_XDP_ENABLE=true
+
+    AC_CHECK_HEADER([bpf/libbpf.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([linux/if_xdp.h], [],
+      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/xsk.h], [],
+      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
+
+    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
+      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])])
+
+    AC_DEFINE([HAVE_AF_XDP], [1],
+              [Define to 1 if AF_XDP support is available and enabled.])
+    LIBBPF_LDADD=" -lbpf -lelf"
+    AC_SUBST([LIBBPF_LDADD])
+  fi
+  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
+])
+
 dnl OVS_CHECK_DPDK
 dnl
 dnl Configure DPDK source tree
diff --git a/configure.ac b/configure.ac
index a9f0a06dc140..36ad246203db 100644
--- a/configure.ac
+++ b/configure.ac
@@ -98,6 +98,7 @@  OVS_CHECK_SPHINX
 OVS_CHECK_DOT
 OVS_CHECK_IF_DL
 OVS_CHECK_STRTOK_R
+OVS_CHECK_LINUX_AF_XDP
 AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
 AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
   [], [], [[#include <sys/stat.h>]])
diff --git a/lib/automake.mk b/lib/automake.mk
index 1b89cac8c3a2..9b75e47ba396 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -14,6 +14,10 @@  if WIN32
 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
+endif
+
 lib_libopenvswitch_la_LDFLAGS = \
         $(OVS_LTINFO) \
         -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
@@ -394,6 +398,7 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/if-notifier.h \
 	lib/netdev-linux.c \
 	lib/netdev-linux.h \
+	lib/netdev-linux-private.h \
 	lib/netdev-offload-tc.c \
 	lib/netlink-conntrack.c \
 	lib/netlink-conntrack.h \
@@ -410,6 +415,15 @@  lib_libopenvswitch_la_SOURCES += \
 	lib/tc.h
 endif
 
+if HAVE_AF_XDP
+lib_libopenvswitch_la_SOURCES += \
+	lib/xdpsock.c \
+	lib/xdpsock.h \
+	lib/netdev-afxdp.c \
+	lib/netdev-afxdp.h \
+	lib/spinlock.h
+endif
+
 if DPDK_NETDEV
 lib_libopenvswitch_la_SOURCES += \
 	lib/dpdk.c \
diff --git a/lib/dp-packet.c b/lib/dp-packet.c
index 0976a35e758b..e6a7947076b4 100644
--- a/lib/dp-packet.c
+++ b/lib/dp-packet.c
@@ -19,6 +19,7 @@ 
 #include <string.h>
 
 #include "dp-packet.h"
+#include "netdev-afxdp.h"
 #include "netdev-dpdk.h"
 #include "openvswitch/dynamic-string.h"
 #include "util.h"
@@ -59,6 +60,27 @@  dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
     dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
 }
 
+#if HAVE_AF_XDP
+/* Initialize 'b' as an empty dp_packet that contains
+ * memory starting at AF_XDP umem base.
+ */
+void
+dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
+{
+    dp_packet_set_base(b, base);
+    dp_packet_set_data(b, base);
+    dp_packet_set_size(b, 0);
+
+    dp_packet_set_allocated(b, allocated);
+    b->source = DPBUF_AFXDP;
+    dp_packet_reset_offsets(b);
+    pkt_metadata_init(&b->md, 0);
+    dp_packet_reset_cutlen(b);
+    dp_packet_reset_offload(b);
+    b->packet_type = htonl(PT_ETH);
+}
+#endif
+
 /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
  * memory starting at 'base'.  'base' should point to a buffer on the stack.
  * (Nothing actually relies on 'base' being allocated on the stack.  It could
@@ -122,6 +144,8 @@  dp_packet_uninit(struct dp_packet *b)
              * created as a dp_packet */
             free_dpdk_buf((struct dp_packet*) b);
 #endif
+        } else if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
         }
     }
 }
@@ -248,6 +272,9 @@  dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
     case DPBUF_STACK:
         OVS_NOT_REACHED();
 
+    case DPBUF_AFXDP:
+        OVS_NOT_REACHED();
+
     case DPBUF_STUB:
         b->source = DPBUF_MALLOC;
         new_base = xmalloc(new_allocated);
@@ -433,6 +460,7 @@  dp_packet_steal_data(struct dp_packet *b)
 {
     void *p;
     ovs_assert(b->source != DPBUF_DPDK);
+    ovs_assert(b->source != DPBUF_AFXDP);
 
     if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
         p = dp_packet_data(b);
diff --git a/lib/dp-packet.h b/lib/dp-packet.h
index a5e9ade1244a..e3438226e360 100644
--- a/lib/dp-packet.h
+++ b/lib/dp-packet.h
@@ -25,6 +25,7 @@ 
 #include <rte_mbuf.h>
 #endif
 
+#include "netdev-afxdp.h"
 #include "netdev-dpdk.h"
 #include "openvswitch/list.h"
 #include "packets.h"
@@ -42,6 +43,7 @@  enum OVS_PACKED_ENUM dp_packet_source {
     DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
                                 * ref to dp_packet_init_dpdk() in dp-packet.c.
                                 */
+    DPBUF_AFXDP,               /* buffer data from XDP frame */
 };
 
 #define DP_PACKET_CONTEXT_SIZE 64
@@ -89,6 +91,13 @@  struct dp_packet {
     };
 };
 
+#if HAVE_AF_XDP
+struct dp_packet_afxdp {
+    struct umem_pool *mpool;
+    struct dp_packet packet;
+};
+#endif
+
 static inline void *dp_packet_data(const struct dp_packet *);
 static inline void dp_packet_set_data(struct dp_packet *, void *);
 static inline void *dp_packet_base(const struct dp_packet *);
@@ -122,7 +131,9 @@  static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
 void dp_packet_use(struct dp_packet *, void *, size_t);
 void dp_packet_use_stub(struct dp_packet *, void *, size_t);
 void dp_packet_use_const(struct dp_packet *, const void *, size_t);
-
+#if HAVE_AF_XDP
+void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
+#endif
 void dp_packet_init_dpdk(struct dp_packet *);
 
 void dp_packet_init(struct dp_packet *, size_t);
@@ -184,6 +195,11 @@  dp_packet_delete(struct dp_packet *b)
             return;
         }
 
+        if (b->source == DPBUF_AFXDP) {
+            free_afxdp_buf(b);
+            return;
+        }
+
         dp_packet_uninit(b);
         free(b);
     }
diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
index 859c05613ddf..6b6dfda7db1c 100644
--- a/lib/dpif-netdev-perf.h
+++ b/lib/dpif-netdev-perf.h
@@ -21,6 +21,7 @@ 
 #include <stddef.h>
 #include <stdint.h>
 #include <string.h>
+#include <time.h>
 #include <math.h>
 
 #ifdef DPDK_NETDEV
@@ -186,6 +187,24 @@  struct pmd_perf_stats {
     char *log_reason;
 };
 
+#ifdef __linux__
+static inline uint64_t
+rdtsc_syscall(struct pmd_perf_stats *s)
+{
+    struct timespec val;
+    uint64_t v;
+
+    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
+       return s->last_tsc;
+    }
+
+    v  = (uint64_t) val.tv_sec * 1000000000LL;
+    v += (uint64_t) val.tv_nsec;
+
+    return s->last_tsc = v;
+}
+#endif
+
 /* Support for accurate timing of PMD execution on TSC clock cycle level.
  * These functions are intended to be invoked in the context of pmd threads. */
 
@@ -198,6 +217,13 @@  cycles_counter_update(struct pmd_perf_stats *s)
 {
 #ifdef DPDK_NETDEV
     return s->last_tsc = rte_get_tsc_cycles();
+#elif !defined(_MSC_VER) && defined(__x86_64__)
+    uint32_t h, l;
+    asm volatile("rdtsc" : "=a" (l), "=d" (h));
+
+    return s->last_tsc = ((uint64_t) h << 32) | l;
+#elif defined(__linux__)
+    return rdtsc_syscall(s);
 #else
     return s->last_tsc = 0;
 #endif
diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
new file mode 100644
index 000000000000..33d8612153d5
--- /dev/null
+++ b/lib/netdev-afxdp.c
@@ -0,0 +1,891 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "netdev-linux-private.h"
+#include "netdev-linux.h"
+#include "netdev-afxdp.h"
+
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_xdp.h>
+#include <net/if.h>
+#include <stdlib.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "coverage.h"
+#include "dp-packet.h"
+#include "dpif-netdev.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/vlog.h"
+#include "packets.h"
+#include "socket-util.h"
+#include "spinlock.h"
+#include "util.h"
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+COVERAGE_DEFINE(afxdp_cq_empty);
+COVERAGE_DEFINE(afxdp_fq_full);
+
+VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
+#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
+#define UMEM2XPKT(base, i) \
+                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
+                               i * sizeof(struct dp_packet_afxdp))
+
+static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
+                                             int mode);
+static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
+static void xsk_destroy(struct xsk_socket_info *xsk);
+static int xsk_configure_all(struct netdev *netdev);
+static void xsk_destroy_all(struct netdev *netdev);
+
+static struct xsk_umem_info *
+xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
+{
+    struct xsk_umem_config uconfig OVS_UNUSED;
+    struct xsk_umem_info *umem;
+    int ret;
+    int i;
+
+    umem = xcalloc(1, sizeof *umem);
+    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
+                           NULL);
+    if (ret) {
+        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV");
+        free(umem);
+        return NULL;
+    }
+
+    umem->buffer = buffer;
+
+    /* set-up umem pool */
+    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
+        VLOG_ERR("umem_pool_init failed");
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free(umem);
+        return NULL;
+    }
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct umem_elem *elem;
+
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)umem->buffer + i * FRAME_SIZE);
+        umem_elem_push(&umem->mpool, elem);
+    }
+
+    /* set-up metadata */
+    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
+        VLOG_ERR("xpacket_pool_init failed");
+        umem_pool_cleanup(&umem->mpool);
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free(umem);
+        return NULL;
+    }
+
+    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
+              umem->xpool.array,
+              (char *)umem->xpool.array +
+              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
+
+    for (i = NUM_FRAMES - 1; i >= 0; i--) {
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        xpacket = UMEM2XPKT(umem->xpool.array, i);
+        xpacket->mpool = &umem->mpool;
+
+        packet = &xpacket->packet;
+        packet->source = DPBUF_AFXDP;
+    }
+
+    return umem;
+}
+
+static struct xsk_socket_info *
+xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
+                     uint32_t queue_id, int xdpmode)
+{
+    struct xsk_socket_config cfg;
+    struct xsk_socket_info *xsk;
+    char devname[IF_NAMESIZE];
+    uint32_t idx = 0, prog_id;
+    int ret;
+    int i;
+
+    xsk = xcalloc(1, sizeof(*xsk));
+    xsk->umem = umem;
+    cfg.rx_size = CONS_NUM_DESCS;
+    cfg.tx_size = PROD_NUM_DESCS;
+    cfg.libbpf_flags = 0;
+
+    if (xdpmode == XDP_ZEROCOPY) {
+        cfg.bind_flags = XDP_ZEROCOPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+    } else {
+        cfg.bind_flags = XDP_COPY;
+        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+    }
+
+    if (if_indextoname(ifindex, devname) == NULL) {
+        VLOG_ERR("ifindex %d to devname failed (%s)",
+                 ifindex, ovs_strerror(errno));
+        free(xsk);
+        return NULL;
+    }
+
+    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
+                             &xsk->rx, &xsk->tx, &cfg);
+    if (ret) {
+        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
+                 ovs_strerror(errno),
+                 xdpmode == XDP_COPY ? "SKB": "DRV",
+                 queue_id);
+        free(xsk);
+        return NULL;
+    }
+
+    /* Make sure the built-in AF_XDP program is loaded */
+    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
+    if (ret) {
+        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
+        xsk_socket__delete(xsk->xsk);
+        free(xsk);
+        return NULL;
+    }
+
+    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
+                                   PROD_NUM_DESCS, &idx)) {
+        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue");
+    }
+
+    for (i = 0;
+         i < PROD_NUM_DESCS * FRAME_SIZE;
+         i += FRAME_SIZE) {
+        struct umem_elem *elem;
+        uint64_t addr;
+
+        elem = umem_elem_pop(&xsk->umem->mpool);
+        addr = UMEM2DESC(elem, xsk->umem->buffer);
+
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
+    }
+
+    xsk_ring_prod__submit(&xsk->umem->fq,
+                          PROD_NUM_DESCS);
+    return xsk;
+}
+
+static struct xsk_socket_info *
+xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
+{
+    struct xsk_socket_info *xsk;
+    struct xsk_umem_info *umem;
+    void *bufs;
+
+    /* umem memory region */
+    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
+    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
+
+    /* create AF_XDP socket */
+    umem = xsk_configure_umem(bufs,
+                              NUM_FRAMES * FRAME_SIZE,
+                              xdpmode);
+    if (!umem) {
+        free_pagealign(bufs);
+        return NULL;
+    }
+
+    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
+    if (!xsk) {
+        /* clean up umem and xpacket pool */
+        if (xsk_umem__delete(umem->umem)) {
+            VLOG_ERR("xsk_umem__delete failed");
+        }
+        free_pagealign(bufs);
+        umem_pool_cleanup(&umem->mpool);
+        xpacket_pool_cleanup(&umem->xpool);
+        free(umem);
+    }
+    return xsk;
+}
+
+static int
+xsk_configure_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk_info;
+    int i, ifindex, n_rxq;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    n_rxq = netdev_n_rxq(netdev);
+    dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *));
+
+    /* configure each queue */
+    for (i = 0; i < n_rxq; i++) {
+        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
+                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
+        xsk_info = xsk_configure(ifindex, i, dev->xdpmode);
+        if (!xsk_info) {
+            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
+            dev->xsks[i] = NULL;
+            goto err;
+        }
+        dev->xsks[i] = xsk_info;
+        xsk_info->rx_dropped = 0;
+        xsk_info->tx_dropped = 0;
+    }
+
+    return 0;
+
+err:
+    xsk_destroy_all(netdev);
+    return EINVAL;
+}
+
+static void
+xsk_destroy(struct xsk_socket_info *xsk_info)
+{
+    struct xsk_umem *umem;
+
+    xsk_socket__delete(xsk_info->xsk);
+    xsk_info->xsk = NULL;
+
+    umem = xsk_info->umem->umem;
+    if (xsk_umem__delete(umem)) {
+        VLOG_ERR("xsk_umem__delete failed");
+    }
+
+    /* free the packet buffer */
+    free_pagealign(xsk_info->umem->buffer);
+
+    /* cleanup umem pool */
+    umem_pool_cleanup(&xsk_info->umem->mpool);
+
+    /* cleanup metadata pool */
+    xpacket_pool_cleanup(&xsk_info->umem->xpool);
+
+    free(xsk_info->umem);
+    free(xsk_info);
+}
+
+static void
+xsk_destroy_all(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int i, ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    for (i = 0; i < netdev_n_rxq(netdev); i++) {
+        if (dev->xsks && dev->xsks[i]) {
+            VLOG_INFO("destroy xsk[%d]", i);
+            xsk_destroy(dev->xsks[i]);
+            dev->xsks[i] = NULL;
+        }
+    }
+
+    VLOG_INFO("remove xdp program");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+
+    free(dev->xsks);
+}
+
+static inline void OVS_UNUSED
+log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
+    struct xdp_statistics stat;
+    socklen_t optlen;
+
+    optlen = sizeof stat;
+    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
+               &stat, &optlen) == 0);
+
+    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
+                stat.rx_dropped,
+                stat.rx_invalid_descs,
+                stat.tx_invalid_descs);
+}
+
+int
+netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                        char **errp OVS_UNUSED)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    const char *str_xdpmode;
+    int xdpmode, new_n_rxq;
+
+    ovs_mutex_lock(&dev->mutex);
+    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
+    if (new_n_rxq > MAX_XSKQ) {
+        ovs_mutex_unlock(&dev->mutex);
+        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
+                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
+        return EINVAL;
+    }
+
+    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
+    if (!strcasecmp(str_xdpmode, "drv")) {
+        xdpmode = XDP_ZEROCOPY;
+    } else if (!strcasecmp(str_xdpmode, "skb")) {
+        xdpmode = XDP_COPY;
+    } else {
+        VLOG_ERR("%s: Incorrect xdpmode (%s).",
+                 netdev_get_name(netdev), str_xdpmode);
+        ovs_mutex_unlock(&dev->mutex);
+        return EINVAL;
+    }
+
+    if (dev->requested_n_rxq != new_n_rxq
+        || dev->requested_xdpmode != xdpmode) {
+        dev->requested_n_rxq = new_n_rxq;
+        dev->requested_xdpmode = xdpmode;
+        netdev_request_reconfigure(netdev);
+    }
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+int
+netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+
+    ovs_mutex_lock(&dev->mutex);
+    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
+    smap_add_format(args, "xdpmode", "%s",
+        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
+    ovs_mutex_unlock(&dev->mutex);
+    return 0;
+}
+
+static void
+netdev_afxdp_alloc_txq(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int n_txqs = netdev_n_rxq(netdev);
+    int i;
+
+    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
+
+    for (i = 0; i < n_txqs; i++) {
+        ovs_spinlock_init(&dev->tx_locks[i]);
+    }
+}
+
+int
+netdev_afxdp_reconfigure(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+    int err = 0;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    if (netdev->n_rxq == dev->requested_n_rxq
+        && dev->xdpmode == dev->requested_xdpmode) {
+        goto out;
+    }
+
+    xsk_destroy_all(netdev);
+    free(dev->tx_locks);
+
+    netdev->n_rxq = dev->requested_n_rxq;
+    netdev_afxdp_alloc_txq(netdev);
+
+    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
+        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
+        /* From SKB mode to DRV mode */
+        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+        dev->xdp_bind_flags = XDP_ZEROCOPY;
+        dev->xdpmode = XDP_ZEROCOPY;
+
+        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
+                      ovs_strerror(errno));
+        }
+    } else {
+        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
+        /* From DRV mode to SKB mode */
+        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+        dev->xdp_bind_flags = XDP_COPY;
+        dev->xdpmode = XDP_COPY;
+        /* TODO: set rlimit back to previous value
+         * when no device is in DRV mode.
+         */
+    }
+
+    err = xsk_configure_all(netdev);
+    if (err) {
+        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
+    }
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    return err;
+}
+
+int
+netdev_afxdp_get_numa_id(const struct netdev *netdev)
+{
+    /* FIXME: Get netdev's PCIe device ID, then find
+     * its NUMA node id.
+     */
+    VLOG_INFO("FIXME: Device %s always use numa id 0",
+              netdev_get_name(netdev));
+    return 0;
+}
+
+static void
+xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
+{
+    uint32_t prog_id = 0;
+    uint32_t flags;
+
+    /* remove_xdp_program() */
+    if (xdpmode == XDP_COPY) {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
+        VLOG_INFO("%s copy mode", __func__);
+    } else {
+        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+        VLOG_INFO("%s drv mode", __func__);
+    }
+
+    if (bpf_get_link_xdp_id(ifindex, &prog_id, flags)) {
+        VLOG_WARN("get xdp program id fails");
+    }
+    bpf_set_link_xdp_fd(ifindex, -1, XDP_FLAGS_UPDATE_IF_NOEXIST);
+}
+
+void
+signal_remove_xdp(struct netdev *netdev)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    int ifindex;
+
+    ifindex = linux_get_ifindex(netdev_get_name(netdev));
+
+    VLOG_WARN("force remove xdp program");
+    xsk_remove_xdp_program(ifindex, dev->xdpmode);
+}
+
+static struct dp_packet_afxdp *
+dp_packet_cast_afxdp(const struct dp_packet *d)
+{
+    ovs_assert(d->source == DPBUF_AFXDP);
+    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
+}
+
+static inline void
+prepare_fill_queue(struct xsk_socket_info *xsk_info)
+{
+    struct umem_elem *elems[BATCH_SIZE];
+    struct xsk_umem_info *umem;
+    unsigned int idx_fq;
+    int nb_free;
+    int i, ret;
+
+    umem = xsk_info->umem;
+
+    nb_free = PROD_NUM_DESCS / 2;
+    if (xsk_prod_nb_free(&umem->fq, nb_free) < nb_free) {
+        return;
+    }
+
+    ret = umem_elem_pop_n(&umem->mpool, BATCH_SIZE, (void **)elems);
+    if (OVS_UNLIKELY(ret)) {
+        return;
+    }
+
+    if (!xsk_ring_prod__reserve(&umem->fq, BATCH_SIZE, &idx_fq)) {
+        umem_elem_push_n(&umem->mpool, BATCH_SIZE, (void **)elems);
+        COVERAGE_INC(afxdp_fq_full);
+        return;
+    }
+
+    for (i = 0; i < BATCH_SIZE; i++) {
+        uint64_t index;
+        struct umem_elem *elem;
+
+        elem = elems[i];
+        index = (uint64_t)((char *)elem - (char *)umem->buffer);
+        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
+        *xsk_ring_prod__fill_addr(&umem->fq, idx_fq) = index;
+
+        idx_fq++;
+    }
+    xsk_ring_prod__submit(&umem->fq, BATCH_SIZE);
+}
+
+int
+netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
+                      int *qfill)
+{
+    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
+    struct netdev *netdev = rx->up.netdev;
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk_info;
+    struct xsk_umem_info *umem;
+    uint32_t idx_rx = 0;
+    int qid = rxq_->queue_id;
+    unsigned int rcvd, i;
+
+    xsk_info = dev->xsks[qid];
+    if (!xsk_info || !xsk_info->xsk) {
+        return 0;
+    }
+
+    prepare_fill_queue(xsk_info);
+
+    umem = xsk_info->umem;
+    rx->fd = xsk_socket__fd(xsk_info->xsk);
+
+    rcvd = xsk_ring_cons__peek(&xsk_info->rx, BATCH_SIZE, &idx_rx);
+    if (!rcvd) {
+        return 0;
+    }
+
+    /* Setup a dp_packet batch from descriptors in RX queue */
+    for (i = 0; i < rcvd; i++) {
+        uint64_t addr = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->addr;
+        uint32_t len = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->len;
+        char *pkt = xsk_umem__get_data(umem->buffer, addr);
+        uint64_t index;
+
+        struct dp_packet_afxdp *xpacket;
+        struct dp_packet *packet;
+
+        index = addr >> FRAME_SHIFT;
+        xpacket = UMEM2XPKT(umem->xpool.array, index);
+        packet = &xpacket->packet;
+
+        /* Initialize the struct dp_packet */
+        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
+        dp_packet_set_size(packet, len);
+
+        /* Add packet into batch, increase batch->count */
+        dp_packet_batch_add(batch, packet);
+
+        idx_rx++;
+    }
+    /* Release the RX queue */
+    xsk_ring_cons__release(&xsk_info->rx, rcvd);
+
+    if (qfill) {
+        /* TODO: return the number of remaining packets in the queue. */
+        *qfill = 0;
+    }
+
+#ifdef AFXDP_DEBUG
+    log_xsk_stat(xsk_info);
+#endif
+    return 0;
+}
+
+static inline int
+kick_tx(struct xsk_socket_info *xsk_info)
+{
+    int ret;
+
+    if (!xsk_info->outstanding_tx) {
+        return 0;
+    }
+
+    /* This causes system call into kernel's xsk_sendmsg, and
+     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
+     */
+    ret = sendto(xsk_socket__fd(xsk_info->xsk), NULL, 0, MSG_DONTWAIT,
+                                NULL, 0);
+    if (OVS_UNLIKELY(ret < 0)) {
+        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
+            return errno;
+        }
+    }
+    /* no error, or EBUSY or EAGAIN */
+    return 0;
+}
+
+void
+free_afxdp_buf(struct dp_packet *p)
+{
+    struct dp_packet_afxdp *xpacket;
+    uintptr_t addr;
+
+    xpacket = dp_packet_cast_afxdp(p);
+    if (xpacket->mpool) {
+        void *base = dp_packet_base(p);
+
+        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
+        umem_elem_push(xpacket->mpool, (void *)addr);
+    }
+}
+
+static void
+free_afxdp_buf_batch(struct dp_packet_batch *batch)
+{
+    struct dp_packet_afxdp *xpacket = NULL;
+    struct dp_packet *packet;
+    void *elems[BATCH_SIZE];
+    uintptr_t addr;
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        xpacket = dp_packet_cast_afxdp(packet);
+        if (xpacket->mpool) {
+            void *base = dp_packet_base(packet);
+
+            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
+            elems[i] = (void *)addr;
+        }
+    }
+    umem_elem_push_n(xpacket->mpool, batch->count, elems);
+    dp_packet_batch_init(batch);
+}
+
+static inline bool
+check_free_batch(struct dp_packet_batch *batch)
+{
+    struct umem_pool *first_mpool = NULL;
+    struct dp_packet_afxdp *xpacket;
+    struct dp_packet *packet;
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        if (packet->source != DPBUF_AFXDP) {
+            return false;
+        }
+        xpacket = dp_packet_cast_afxdp(packet);
+        if (i == 0) {
+            first_mpool = xpacket->mpool;
+            continue;
+        }
+        if (xpacket->mpool != first_mpool) {
+            return false;
+        }
+    }
+    /* All packets are DPBUF_AFXDP and from the same mpool */
+    return true;
+}
+
+static inline void
+afxdp_complete_tx(struct xsk_socket_info *xsk_info)
+{
+    struct umem_elem *elems_push[BATCH_SIZE];
+    struct xsk_umem_info *umem;
+    uint32_t idx_cq = 0;
+    int tx_to_free = 0;
+    int tx_done, j;
+
+    umem = xsk_info->umem;
+    tx_done = xsk_ring_cons__peek(&umem->cq, BATCH_SIZE, &idx_cq);
+
+    /* Recycle back to umem pool */
+    for (j = 0; j < tx_done; j++) {
+        struct umem_elem *elem;
+        uint64_t *addr;
+
+        addr = (uint64_t *)xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
+        if (*addr == 0) {
+            /* The elem has been pushed already */
+            continue;
+        }
+        elem = ALIGNED_CAST(struct umem_elem *,
+                            (char *)umem->buffer + *addr);
+        elems_push[tx_to_free] = elem;
+        *addr = 0; /* Mark as pushed */
+        tx_to_free++;
+    }
+
+    umem_elem_push_n(&umem->mpool, tx_to_free, (void **)elems_push);
+
+    if (tx_done > 0) {
+        xsk_ring_cons__release(&umem->cq, tx_done);
+        xsk_info->outstanding_tx -= tx_done;
+    } else {
+        COVERAGE_INC(afxdp_cq_empty);
+    }
+}
+
+int
+netdev_afxdp_batch_send(struct netdev *netdev, int qid,
+                        struct dp_packet_batch *batch,
+                        bool concurrent_txq)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk_info = dev->xsks[qid];
+    struct umem_elem *elems_pop[BATCH_SIZE];
+    struct xsk_umem_info *umem;
+    struct dp_packet *packet;
+    bool free_batch = true;
+    uint32_t idx = 0;
+    int error = 0;
+    int ret;
+
+    if (OVS_UNLIKELY(concurrent_txq)) {
+        qid = qid % dev->up.n_txq;
+        ovs_spin_lock(&dev->tx_locks[qid]);
+    }
+
+    if (!xsk_info || !xsk_info->xsk) {
+        goto out;
+    }
+
+    afxdp_complete_tx(xsk_info);
+
+    free_batch = check_free_batch(batch);
+
+    umem = xsk_info->umem;
+    ret = umem_elem_pop_n(&umem->mpool, batch->count, (void **)elems_pop);
+    if (OVS_UNLIKELY(ret)) {
+        xsk_info->tx_dropped += batch->count;
+        error = ENOMEM;
+        goto out;
+    }
+
+    /* Make sure we have enough TX descs */
+    ret = xsk_ring_prod__reserve(&xsk_info->tx, batch->count, &idx);
+    if (OVS_UNLIKELY(ret == 0)) {
+        umem_elem_push_n(&umem->mpool, batch->count, (void **)elems_pop);
+        xsk_info->tx_dropped += batch->count;
+        error = ENOMEM;
+        goto out;
+    }
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
+        struct umem_elem *elem;
+        uint64_t index;
+
+        elem = elems_pop[i];
+        /* Copy the packet to the umem we just pop from umem pool.
+         * TODO: avoid this copy if the packet and the pop umem
+         * are located in the same umem.
+         */
+        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
+
+        index = (uint64_t)((char *)elem - (char *)umem->buffer);
+        xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->addr = index;
+        xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->len
+            = dp_packet_size(packet);
+    }
+    xsk_ring_prod__submit(&xsk_info->tx, batch->count);
+    xsk_info->outstanding_tx += batch->count;
+
+    ret = kick_tx(xsk_info);
+    if (OVS_UNLIKELY(ret)) {
+        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
+                     ovs_strerror(ret));
+    }
+
+out:
+    if (free_batch) {
+        free_afxdp_buf_batch(batch);
+    } else {
+        dp_packet_delete_batch(batch, true);
+    }
+
+    if (OVS_UNLIKELY(concurrent_txq)) {
+        ovs_spin_unlock(&dev->tx_locks[qid]);
+    }
+    return error;
+}
+
+int
+netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED)
+{
+   /* Done at reconfigure */
+   return 0;
+}
+
+void
+netdev_afxdp_destruct(struct netdev *netdev_)
+{
+    struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+
+    /* Note: tc is by-passed when using drv-mode, but when using
+     * skb-mode, we might need to clean up tc. */
+
+    xsk_destroy_all(netdev_);
+    ovs_mutex_destroy(&netdev->mutex);
+}
+
+int
+netdev_afxdp_get_stats(const struct netdev *netdev,
+                       struct netdev_stats *stats)
+{
+    struct netdev_linux *dev = netdev_linux_cast(netdev);
+    struct xsk_socket_info *xsk_info;
+    struct netdev_stats dev_stats;
+    int error, i;
+
+    ovs_mutex_lock(&dev->mutex);
+
+    error = get_stats_via_netlink(netdev, &dev_stats);
+    if (error) {
+        VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics");
+    } else {
+        /* Use kernel netdev's packet and byte counts */
+        stats->rx_packets = dev_stats.rx_packets;
+        stats->rx_bytes = dev_stats.rx_bytes;
+        stats->tx_packets = dev_stats.tx_packets;
+        stats->tx_bytes = dev_stats.tx_bytes;
+
+        stats->rx_errors           += dev_stats.rx_errors;
+        stats->tx_errors           += dev_stats.tx_errors;
+        stats->rx_dropped          += dev_stats.rx_dropped;
+        stats->tx_dropped          += dev_stats.tx_dropped;
+        stats->multicast           += dev_stats.multicast;
+        stats->collisions          += dev_stats.collisions;
+        stats->rx_length_errors    += dev_stats.rx_length_errors;
+        stats->rx_over_errors      += dev_stats.rx_over_errors;
+        stats->rx_crc_errors       += dev_stats.rx_crc_errors;
+        stats->rx_frame_errors     += dev_stats.rx_frame_errors;
+        stats->rx_fifo_errors      += dev_stats.rx_fifo_errors;
+        stats->rx_missed_errors    += dev_stats.rx_missed_errors;
+        stats->tx_aborted_errors   += dev_stats.tx_aborted_errors;
+        stats->tx_carrier_errors   += dev_stats.tx_carrier_errors;
+        stats->tx_fifo_errors      += dev_stats.tx_fifo_errors;
+        stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors;
+        stats->tx_window_errors    += dev_stats.tx_window_errors;
+
+        /* Account the dropped in each xsk */
+        for (i = 0; i < netdev_n_rxq(netdev); i++) {
+            xsk_info = dev->xsks[i];
+            if (xsk_info) {
+                stats->rx_dropped += xsk_info->rx_dropped;
+                stats->tx_dropped += xsk_info->tx_dropped;
+            }
+        }
+    }
+    ovs_mutex_unlock(&dev->mutex);
+
+    return error;
+}
diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
new file mode 100644
index 000000000000..dd2dc1a2064d
--- /dev/null
+++ b/lib/netdev-afxdp.h
@@ -0,0 +1,74 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_AFXDP_H
+#define NETDEV_AFXDP_H 1
+
+#include <config.h>
+
+#ifdef HAVE_AF_XDP
+
+#include <stdint.h>
+#include <stdbool.h>
+
+/* These functions are Linux AF_XDP specific, so they should be used directly
+ * only by Linux-specific code. */
+
+#define MAX_XSKQ 16
+
+struct netdev;
+struct xsk_socket_info;
+struct xdp_umem;
+struct dp_packet_batch;
+struct smap;
+struct dp_packet;
+struct netdev_rxq;
+struct netdev_stats;
+
+int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_);
+void netdev_afxdp_destruct(struct netdev *netdev_);
+
+int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_,
+                          struct dp_packet_batch *batch,
+                          int *qfill);
+int netdev_afxdp_batch_send(struct netdev *netdev_, int qid,
+                            struct dp_packet_batch *batch,
+                            bool concurrent_txq);
+int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
+                            char **errp);
+int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args);
+int netdev_afxdp_get_numa_id(const struct netdev *netdev);
+int netdev_afxdp_get_stats(const struct netdev *netdev_,
+                           struct netdev_stats *stats);
+
+void free_afxdp_buf(struct dp_packet *p);
+int netdev_afxdp_reconfigure(struct netdev *netdev);
+void signal_remove_xdp(struct netdev *netdev);
+
+#else /* !HAVE_AF_XDP */
+
+#include "openvswitch/compiler.h"
+
+struct dp_packet;
+
+static inline void
+free_afxdp_buf(struct dp_packet *p OVS_UNUSED)
+{
+    /* Nothing */
+}
+
+#endif /* HAVE_AF_XDP */
+#endif /* netdev-afxdp.h */
diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
new file mode 100644
index 000000000000..6b6768e7a240
--- /dev/null
+++ b/lib/netdev-linux-private.h
@@ -0,0 +1,138 @@ 
+/*
+ * Copyright (c) 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef NETDEV_LINUX_PRIVATE_H
+#define NETDEV_LINUX_PRIVATE_H 1
+
+#include <config.h>
+
+#include <linux/filter.h>
+#include <linux/gen_stats.h>
+#include <linux/if_ether.h>
+#include <linux/if_tun.h>
+#include <linux/types.h>
+#include <linux/ethtool.h>
+#include <linux/mii.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#include "netdev-afxdp.h"
+#include "netdev-provider.h"
+#include "netdev-vport.h"
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "timer.h"
+#include "xdpsock.h"
+
+/* These functions are Linux specific, so they should be used directly only by
+ * Linux-specific code. */
+
+struct netdev;
+
+struct netdev_rxq_linux {
+    struct netdev_rxq up;
+    bool is_tap;
+    int fd;
+};
+
+void netdev_linux_run(const struct netdev_class *);
+
+int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
+                                  const char *flag_name, bool enable);
+
+int get_stats_via_netlink(const struct netdev *netdev_,
+                          struct netdev_stats *stats);
+
+struct netdev_linux {
+    struct netdev up;
+
+    /* Protects all members below. */
+    struct ovs_mutex mutex;
+
+    unsigned int cache_valid;
+
+    bool miimon;                    /* Link status of last poll. */
+    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
+    struct timer miimon_timer;
+
+    int netnsid;                    /* Network namespace ID. */
+    /* The following are figured out "on demand" only.  They are only valid
+     * when the corresponding VALID_* bit in 'cache_valid' is set. */
+    int ifindex;
+    struct eth_addr etheraddr;
+    int mtu;
+    unsigned int ifi_flags;
+    long long int carrier_resets;
+    uint32_t kbits_rate;        /* Policing data. */
+    uint32_t kbits_burst;
+    int vport_stats_error;      /* Cached error code from vport_get_stats().
+                                   0 or an errno value. */
+    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
+                                 * or SIOCSIFMTU.
+                                 */
+    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
+    int netdev_policing_error;  /* Cached error code from set policing. */
+    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
+    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
+
+    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
+    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
+    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
+
+    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
+    struct tc *tc;
+
+    /* For devices of class netdev_tap_class only. */
+    int tap_fd;
+    bool present;               /* If the device is present in the namespace */
+    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
+
+    /* LAG information. */
+    bool is_lag_master;         /* True if the netdev is a LAG master. */
+
+    /* AF_XDP information */
+#ifdef HAVE_AF_XDP
+    struct xsk_socket_info **xsks;
+    int requested_n_rxq;
+    int xdpmode, requested_xdpmode; /* detect mode changed */
+    int xdp_flags, xdp_bind_flags;
+    struct ovs_spinlock *tx_locks;
+#endif
+};
+
+static bool
+is_netdev_linux_class(const struct netdev_class *netdev_class)
+{
+    return netdev_class->run == netdev_linux_run;
+}
+
+static struct netdev_linux *
+netdev_linux_cast(const struct netdev *netdev)
+{
+    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
+
+    return CONTAINER_OF(netdev, struct netdev_linux, up);
+}
+
+static struct netdev_rxq_linux *
+netdev_rxq_linux_cast(const struct netdev_rxq *rx)
+{
+    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
+
+    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
+}
+
+#endif /* netdev-linux-private.h */
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index e4ea94cf9243..2ba72e117989 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -17,6 +17,7 @@ 
 #include <config.h>
 
 #include "netdev-linux.h"
+#include "netdev-linux-private.h"
 
 #include <errno.h>
 #include <fcntl.h>
@@ -54,6 +55,7 @@ 
 #include "fatal-signal.h"
 #include "hash.h"
 #include "openvswitch/hmap.h"
+#include "netdev-afxdp.h"
 #include "netdev-provider.h"
 #include "netdev-vport.h"
 #include "netlink-notifier.h"
@@ -486,57 +488,6 @@  static int tc_calc_cell_log(unsigned int mtu);
 static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu);
 static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes);
 
-struct netdev_linux {
-    struct netdev up;
-
-    /* Protects all members below. */
-    struct ovs_mutex mutex;
-
-    unsigned int cache_valid;
-
-    bool miimon;                    /* Link status of last poll. */
-    long long int miimon_interval;  /* Miimon Poll rate. Disabled if <= 0. */
-    struct timer miimon_timer;
-
-    int netnsid;                    /* Network namespace ID. */
-    /* The following are figured out "on demand" only.  They are only valid
-     * when the corresponding VALID_* bit in 'cache_valid' is set. */
-    int ifindex;
-    struct eth_addr etheraddr;
-    int mtu;
-    unsigned int ifi_flags;
-    long long int carrier_resets;
-    uint32_t kbits_rate;        /* Policing data. */
-    uint32_t kbits_burst;
-    int vport_stats_error;      /* Cached error code from vport_get_stats().
-                                   0 or an errno value. */
-    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
-    int ether_addr_error;       /* Cached error code from set/get etheraddr. */
-    int netdev_policing_error;  /* Cached error code from set policing. */
-    int get_features_error;     /* Cached error code from ETHTOOL_GSET. */
-    int get_ifindex_error;      /* Cached error code from SIOCGIFINDEX. */
-
-    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
-    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
-    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
-
-    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO. */
-    struct tc *tc;
-
-    /* For devices of class netdev_tap_class only. */
-    int tap_fd;
-    bool present;               /* If the device is present in the namespace */
-    uint64_t tx_dropped;        /* tap device can drop if the iface is down */
-
-    /* LAG information. */
-    bool is_lag_master;         /* True if the netdev is a LAG master. */
-};
-
-struct netdev_rxq_linux {
-    struct netdev_rxq up;
-    bool is_tap;
-    int fd;
-};
 
 /* This is set pretty low because we probably won't learn anything from the
  * additional log messages. */
@@ -550,8 +501,6 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
  * changes in the device miimon status, so we can use atomic_count. */
 static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
 
-static void netdev_linux_run(const struct netdev_class *);
-
 static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
                                    int cmd, const char *cmd_name);
 static int get_flags(const struct netdev *, unsigned int *flags);
@@ -565,7 +514,6 @@  static int do_set_addr(struct netdev *netdev,
                        struct in_addr addr);
 static int get_etheraddr(const char *netdev_name, struct eth_addr *ea);
 static int set_etheraddr(const char *netdev_name, const struct eth_addr);
-static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *);
 static int af_packet_sock(void);
 static bool netdev_linux_miimon_enabled(void);
 static void netdev_linux_miimon_run(void);
@@ -573,31 +521,10 @@  static void netdev_linux_miimon_wait(void);
 static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup);
 
 static bool
-is_netdev_linux_class(const struct netdev_class *netdev_class)
-{
-    return netdev_class->run == netdev_linux_run;
-}
-
-static bool
 is_tap_netdev(const struct netdev *netdev)
 {
     return netdev_get_class(netdev) == &netdev_tap_class;
 }
-
-static struct netdev_linux *
-netdev_linux_cast(const struct netdev *netdev)
-{
-    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
-
-    return CONTAINER_OF(netdev, struct netdev_linux, up);
-}
-
-static struct netdev_rxq_linux *
-netdev_rxq_linux_cast(const struct netdev_rxq *rx)
-{
-    ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev)));
-    return CONTAINER_OF(rx, struct netdev_rxq_linux, up);
-}
 
 static int
 netdev_linux_netnsid_update__(struct netdev_linux *netdev)
@@ -773,7 +700,7 @@  netdev_linux_update_lag(struct rtnetlink_change *change)
     }
 }
 
-static void
+void
 netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED)
 {
     struct nl_sock *sock;
@@ -3278,9 +3205,7 @@  exit:
     .run = netdev_linux_run,                                    \
     .wait = netdev_linux_wait,                                  \
     .alloc = netdev_linux_alloc,                                \
-    .destruct = netdev_linux_destruct,                          \
     .dealloc = netdev_linux_dealloc,                            \
-    .send = netdev_linux_send,                                  \
     .send_wait = netdev_linux_send_wait,                        \
     .set_etheraddr = netdev_linux_set_etheraddr,                \
     .get_etheraddr = netdev_linux_get_etheraddr,                \
@@ -3311,39 +3236,71 @@  exit:
     .arp_lookup = netdev_linux_arp_lookup,                      \
     .update_flags = netdev_linux_update_flags,                  \
     .rxq_alloc = netdev_linux_rxq_alloc,                        \
-    .rxq_construct = netdev_linux_rxq_construct,                \
     .rxq_destruct = netdev_linux_rxq_destruct,                  \
     .rxq_dealloc = netdev_linux_rxq_dealloc,                    \
-    .rxq_recv = netdev_linux_rxq_recv,                          \
     .rxq_wait = netdev_linux_rxq_wait,                          \
     .rxq_drain = netdev_linux_rxq_drain
 
 const struct netdev_class netdev_linux_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "system",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_linux_get_stats,
     .get_features = netdev_linux_get_features,
     .get_status = netdev_linux_get_status,
-    .get_block_id = netdev_linux_get_block_id
+    .get_block_id = netdev_linux_get_block_id,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
 
 const struct netdev_class netdev_tap_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "tap",
+    .is_pmd = false,
     .construct = netdev_linux_construct_tap,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_tap_get_stats,
     .get_features = netdev_linux_get_features,
     .get_status = netdev_linux_get_status,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
 
 const struct netdev_class netdev_internal_class = {
     NETDEV_LINUX_CLASS_COMMON,
     .type = "internal",
+    .is_pmd = false,
     .construct = netdev_linux_construct,
+    .destruct = netdev_linux_destruct,
     .get_stats = netdev_internal_get_stats,
     .get_status = netdev_internal_get_status,
+    .send = netdev_linux_send,
+    .rxq_construct = netdev_linux_rxq_construct,
+    .rxq_recv = netdev_linux_rxq_recv,
 };
+
+#ifdef HAVE_AF_XDP
+const struct netdev_class netdev_afxdp_class = {
+    NETDEV_LINUX_CLASS_COMMON,
+    .type = "afxdp",
+    .is_pmd = true,
+    .construct = netdev_linux_construct,
+    .destruct = netdev_afxdp_destruct,
+    .get_stats = netdev_afxdp_get_stats,
+    .get_status = netdev_linux_get_status,
+    .set_config = netdev_afxdp_set_config,
+    .get_config = netdev_afxdp_get_config,
+    .reconfigure = netdev_afxdp_reconfigure,
+    .get_numa_id = netdev_afxdp_get_numa_id,
+    .send = netdev_afxdp_batch_send,
+    .rxq_construct = netdev_afxdp_rxq_construct,
+    .rxq_recv = netdev_afxdp_rxq_recv,
+};
+#endif
 
 
 #define CODEL_N_QUEUES 0x0000
@@ -5915,7 +5872,7 @@  netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst,
     dst->tx_window_errors = src->tx_window_errors;
 }
 
-static int
+int
 get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats)
 {
     struct ofpbuf request;
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index b2e7078886c7..4986c05ed9d6 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -825,6 +825,9 @@  extern const struct netdev_class netdev_linux_class;
 extern const struct netdev_class netdev_internal_class;
 extern const struct netdev_class netdev_tap_class;
 
+#ifdef HAVE_AF_XDP
+extern const struct netdev_class netdev_afxdp_class;
+#endif
 #ifdef  __cplusplus
 }
 #endif
diff --git a/lib/netdev.c b/lib/netdev.c
index 96587996f636..f80bd5bd9e5f 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -103,6 +103,9 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 
 static void restore_all_flags(void *aux OVS_UNUSED);
 void update_device_args(struct netdev *, const struct shash *args);
+#ifdef HAVE_AF_XDP
+void signal_remove_xdp(struct netdev *netdev);
+#endif
 
 int
 netdev_n_txq(const struct netdev *netdev)
@@ -147,6 +150,9 @@  netdev_initialize(void)
         netdev_vport_tunnel_register();
 
         netdev_register_flow_api_provider(&netdev_offload_tc);
+#ifdef HAVE_AF_XDP
+        netdev_register_provider(&netdev_afxdp_class);
+#endif
 #endif
 #if defined(__FreeBSD__) || defined(__NetBSD__)
         netdev_register_provider(&netdev_tap_class);
@@ -2011,6 +2017,11 @@  restore_all_flags(void *aux OVS_UNUSED)
                                                saved_flags & ~saved_values,
                                                &old_flags);
         }
+#ifdef HAVE_AF_XDP
+        if (netdev->netdev_class == &netdev_afxdp_class) {
+            signal_remove_xdp(netdev);
+        }
+#endif
     }
 }
 
diff --git a/lib/spinlock.h b/lib/spinlock.h
new file mode 100644
index 000000000000..1ae634f23a6b
--- /dev/null
+++ b/lib/spinlock.h
@@ -0,0 +1,70 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef SPINLOCK_H
+#define SPINLOCK_H 1
+
+#include <config.h>
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdarg.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "ovs-atomic.h"
+
+struct ovs_spinlock {
+    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_int locked;
+};
+
+static inline void
+ovs_spinlock_init(struct ovs_spinlock *sl)
+{
+    atomic_init(&sl->locked, 0);
+}
+
+static inline void
+ovs_spin_lock(struct ovs_spinlock *sl)
+{
+    int exp = 0, locked = 0;
+
+    while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed)) {
+        locked = 1;
+        while (locked) {
+            atomic_read_relaxed(&sl->locked, &locked);
+        }
+        exp = 0;
+    }
+}
+
+static inline void
+ovs_spin_unlock(struct ovs_spinlock *sl)
+{
+    atomic_store_explicit(&sl->locked, 0, memory_order_release);
+}
+
+static inline int
+ovs_spin_trylock(struct ovs_spinlock *sl)
+{
+    int exp = 0;
+    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1,
+                memory_order_acquire,
+                memory_order_relaxed);
+}
+#endif
diff --git a/lib/util.c b/lib/util.c
index 7b8ab81f6ee1..5eb20995b370 100644
--- a/lib/util.c
+++ b/lib/util.c
@@ -214,20 +214,19 @@  x2nrealloc(void *p, size_t *n, size_t s)
     return xrealloc(p, *n * s);
 }
 
-/* Allocates and returns 'size' bytes of memory aligned to a cache line and in
- * dedicated cache lines.  That is, the memory block returned will not share a
- * cache line with other data, avoiding "false sharing".
+/* Allocates and returns 'size' bytes of memory aligned to 'alignment' bytes.
+ * 'alignment' must be a power of two and a multiple of sizeof(void *).
  *
- * Use free_cacheline() to free the returned memory block. */
+ * Use free_size_align() to free the returned memory block. */
 void *
-xmalloc_cacheline(size_t size)
+xmalloc_size_align(size_t size, size_t alignment)
 {
 #ifdef HAVE_POSIX_MEMALIGN
     void *p;
     int error;
 
     COVERAGE_INC(util_xalloc);
-    error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1);
+    error = posix_memalign(&p, alignment, size ? size : 1);
     if (error != 0) {
         out_of_memory();
     }
@@ -235,16 +234,16 @@  xmalloc_cacheline(size_t size)
 #else
     /* Allocate room for:
      *
-     *     - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the
-     *       pointer to be aligned exactly sizeof(void *) bytes before the
-     *       beginning of a cache line.
+     *     - Header padding: Up to alignment - 1 bytes, to allow the
+     *       pointer 'q' to be aligned exactly sizeof(void *) bytes before the
+     *       beginning of the alignment.
      *
      *     - Pointer: A pointer to the start of the header padding, to allow us
      *       to free() the block later.
      *
      *     - User data: 'size' bytes.
      *
-     *     - Trailer padding: Enough to bring the user data up to a cache line
+     *     - Trailer padding: Enough to bring the user data up to a alignment
      *       multiple.
      *
      * +---------------+---------+------------------------+---------+
@@ -255,18 +254,56 @@  xmalloc_cacheline(size_t size)
      * p               q         r
      *
      */
-    void *p = xmalloc((CACHE_LINE_SIZE - 1)
-                      + sizeof(void *)
-                      + ROUND_UP(size, CACHE_LINE_SIZE));
-    bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *);
-    void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0),
-                                CACHE_LINE_SIZE);
-    void **q = (void **) r - 1;
+    void *p, *r, **q;
+    bool runt;
+
+    COVERAGE_INC(util_xalloc);
+    if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) {
+        ovs_abort(0, "Invalid alignment");
+    }
+
+    p = xmalloc((alignment - 1)
+                + sizeof(void *)
+                + ROUND_UP(size, alignment));
+
+    runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *);
+    /* When the padding size < sizeof(void*), we don't have enough room for
+     * pointer 'q'. As a reuslt, need to move 'r' to the next alignment.
+     * So ROUND_UP when xmalloc above, and ROUND_UP again when calculate 'r'
+     * below.
+     */
+    r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0), alignment);
+    q = (void **) r - 1;
     *q = p;
+
     return r;
 #endif
 }
 
+void
+free_size_align(void *p)
+{
+#ifdef HAVE_POSIX_MEMALIGN
+    free(p);
+#else
+    if (p) {
+        void **q = (void **) p - 1;
+        free(*q);
+    }
+#endif
+}
+
+/* Allocates and returns 'size' bytes of memory aligned to a cache line and in
+ * dedicated cache lines.  That is, the memory block returned will not share a
+ * cache line with other data, avoiding "false sharing".
+ *
+ * Use free_cacheline() to free the returned memory block. */
+void *
+xmalloc_cacheline(size_t size)
+{
+    return xmalloc_size_align(size, CACHE_LINE_SIZE);
+}
+
 /* Like xmalloc_cacheline() but clears the allocated memory to all zero
  * bytes. */
 void *
@@ -282,14 +319,19 @@  xzalloc_cacheline(size_t size)
 void
 free_cacheline(void *p)
 {
-#ifdef HAVE_POSIX_MEMALIGN
-    free(p);
-#else
-    if (p) {
-        void **q = (void **) p - 1;
-        free(*q);
-    }
-#endif
+    free_size_align(p);
+}
+
+void *
+xmalloc_pagealign(size_t size)
+{
+    return xmalloc_size_align(size, get_page_size());
+}
+
+void
+free_pagealign(void *p)
+{
+    free_size_align(p);
 }
 
 char *
diff --git a/lib/util.h b/lib/util.h
index 095ede20f07f..7ad8758fe637 100644
--- a/lib/util.h
+++ b/lib/util.h
@@ -169,6 +169,11 @@  void ovs_strzcpy(char *dst, const char *src, size_t size);
 
 int string_ends_with(const char *str, const char *suffix);
 
+void *xmalloc_pagealign(size_t) MALLOC_LIKE;
+void free_pagealign(void *);
+void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE;
+void free_size_align(void *);
+
 /* The C standards say that neither the 'dst' nor 'src' argument to
  * memcpy() may be null, even if 'n' is zero.  This wrapper tolerates
  * the null case. */
diff --git a/lib/xdpsock.c b/lib/xdpsock.c
new file mode 100644
index 000000000000..ea39fa557290
--- /dev/null
+++ b/lib/xdpsock.c
@@ -0,0 +1,170 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <config.h>
+
+#include "xdpsock.h"
+#include "dp-packet.h"
+#include "openvswitch/compiler.h"
+
+/* Note:
+ * umem_elem_push* shouldn't overflow because we always pop
+ * elem first, then push back to the stack.
+ */
+static inline void
+__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
+        OVS_NOT_REACHED();
+    }
+
+    ptr = &umemp->array[umemp->index];
+    memcpy(ptr, addrs, n * sizeof(void *));
+    umemp->index += n;
+}
+
+void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    ovs_spin_lock(&umemp->lock);
+    __umem_elem_push_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->lock);
+}
+
+static inline void
+__umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+    if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) {
+        OVS_NOT_REACHED();
+    }
+
+    umemp->array[umemp->index++] = addr;
+}
+
+void
+umem_elem_push(struct umem_pool *umemp, void *addr)
+{
+
+    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
+
+    ovs_spin_lock(&umemp->lock);
+    __umem_elem_push(umemp, addr);
+    ovs_spin_unlock(&umemp->lock);
+}
+
+static inline int
+__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    void *ptr;
+
+    if (OVS_UNLIKELY(umemp->index - n < 0)) {
+        return -ENOMEM;
+    }
+
+    umemp->index -= n;
+    ptr = &umemp->array[umemp->index];
+    memcpy(addrs, ptr, n * sizeof(void *));
+
+    return 0;
+}
+
+int
+umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
+{
+    int ret;
+
+    ovs_spin_lock(&umemp->lock);
+    ret = __umem_elem_pop_n(umemp, n, addrs);
+    ovs_spin_unlock(&umemp->lock);
+
+    return ret;
+}
+
+static inline void *
+__umem_elem_pop(struct umem_pool *umemp)
+{
+    if (OVS_UNLIKELY(umemp->index - 1 < 0)) {
+        return NULL;
+    }
+
+    return umemp->array[--umemp->index];
+}
+
+void *
+umem_elem_pop(struct umem_pool *umemp)
+{
+    void *ptr;
+
+    ovs_spin_lock(&umemp->lock);
+    ptr = __umem_elem_pop(umemp);
+    ovs_spin_unlock(&umemp->lock);
+
+    return ptr;
+}
+
+static void **
+__umem_pool_alloc(unsigned int size)
+{
+    void *bufs;
+
+    bufs = xmalloc_pagealign(size * sizeof(void *));
+    memset(bufs, 0, size * sizeof(void *));
+
+    return (void **)bufs;
+}
+
+int
+umem_pool_init(struct umem_pool *umemp, unsigned int size)
+{
+    umemp->array = __umem_pool_alloc(size);
+    if (!umemp->array) {
+        return -ENOMEM;
+    }
+
+    umemp->size = size;
+    umemp->index = 0;
+    ovs_spinlock_init(&umemp->lock);
+    return 0;
+}
+
+void
+umem_pool_cleanup(struct umem_pool *umemp)
+{
+    free_pagealign(umemp->array);
+    umemp->array = NULL;
+}
+
+/* AF_XDP metadata init/destroy */
+int
+xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
+{
+    void *bufs;
+
+    bufs = xmalloc_pagealign(size * sizeof(struct dp_packet_afxdp));
+    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
+
+    xp->array = bufs;
+    xp->size = size;
+
+    return 0;
+}
+
+void
+xpacket_pool_cleanup(struct xpacket_pool *xp)
+{
+    free_pagealign(xp->array);
+    xp->array = NULL;
+}
diff --git a/lib/xdpsock.h b/lib/xdpsock.h
new file mode 100644
index 000000000000..1a1093381243
--- /dev/null
+++ b/lib/xdpsock.h
@@ -0,0 +1,101 @@ 
+/*
+ * Copyright (c) 2018, 2019 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef XDPSOCK_H
+#define XDPSOCK_H 1
+
+#include <config.h>
+
+#ifdef HAVE_AF_XDP
+
+#include <bpf/xsk.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <stdio.h>
+
+#include "openvswitch/thread.h"
+#include "ovs-atomic.h"
+#include "spinlock.h"
+
+#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
+#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
+#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
+#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
+
+#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
+#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
+
+/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
+ * Setting NUM_FRAMES to this makes sure umem_pop always successes.
+ */
+#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
+
+#define BATCH_SIZE      NETDEV_MAX_BURST
+
+BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
+BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
+BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS));
+
+/* LIFO ptr_array */
+struct umem_pool {
+    int index;      /* point to top */
+    unsigned int size;
+    struct ovs_spinlock lock;
+    void **array;   /* a pointer array, point to umem buf */
+};
+
+/* array-based dp_packet_afxdp */
+struct xpacket_pool {
+    unsigned int size;
+    struct dp_packet_afxdp **array;
+};
+
+struct xsk_umem_info {
+    struct umem_pool mpool;
+    struct xpacket_pool xpool;
+    struct xsk_ring_prod fq;
+    struct xsk_ring_cons cq;
+    struct xsk_umem *umem;
+    void *buffer;
+};
+
+struct xsk_socket_info {
+    struct xsk_ring_cons rx;
+    struct xsk_ring_prod tx;
+    struct xsk_umem_info *umem;
+    struct xsk_socket *xsk;
+    unsigned long rx_dropped;
+    unsigned long tx_dropped;
+    uint32_t outstanding_tx;
+};
+
+struct umem_elem {
+    struct umem_elem *next;
+};
+
+void umem_elem_push(struct umem_pool *umemp, void *addr);
+void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
+
+void *umem_elem_pop(struct umem_pool *umemp);
+int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
+
+int umem_pool_init(struct umem_pool *umemp, unsigned int size);
+void umem_pool_cleanup(struct umem_pool *umemp);
+int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
+void xpacket_pool_cleanup(struct xpacket_pool *xp);
+
+#endif
+#endif
diff --git a/tests/automake.mk b/tests/automake.mk
index 2956e68b242c..131564bb0bd3 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -4,12 +4,14 @@  EXTRA_DIST += \
 	$(SYSTEM_TESTSUITE_AT) \
 	$(SYSTEM_KMOD_TESTSUITE_AT) \
 	$(SYSTEM_USERSPACE_TESTSUITE_AT) \
+	$(SYSTEM_AFXDP_TESTSUITE_AT) \
 	$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
 	$(SYSTEM_DPDK_TESTSUITE_AT) \
 	$(OVSDB_CLUSTER_TESTSUITE_AT) \
 	$(TESTSUITE) \
 	$(SYSTEM_KMOD_TESTSUITE) \
 	$(SYSTEM_USERSPACE_TESTSUITE) \
+	$(SYSTEM_AFXDP_TESTSUITE) \
 	$(SYSTEM_OFFLOADS_TESTSUITE) \
 	$(SYSTEM_DPDK_TESTSUITE) \
 	$(OVSDB_CLUSTER_TESTSUITE) \
@@ -160,6 +162,10 @@  SYSTEM_USERSPACE_TESTSUITE_AT = \
 	tests/system-userspace-macros.at \
 	tests/system-userspace-packet-type-aware.at
 
+SYSTEM_AFXDP_TESTSUITE_AT = \
+	tests/system-afxdp-testsuite.at \
+	tests/system-afxdp-macros.at
+
 SYSTEM_TESTSUITE_AT = \
 	tests/system-common-macros.at \
 	tests/system-ovn.at \
@@ -184,6 +190,7 @@  TESTSUITE = $(srcdir)/tests/testsuite
 TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
 SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
 SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
+SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite
 SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
 SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite
 OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite
@@ -317,6 +324,11 @@  check-system-userspace: all
 	set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
 
+check-afxdp: all
+	$(MAKE) install
+	set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
+	"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
+
 check-offloads: all
 	set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests  AUTOTEST_PATH='$(AUTOTEST_PATH)'; \
 	"$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
@@ -354,6 +366,10 @@  $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
 
+$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT)
+	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
+	$(AM_V_at)mv $@.tmp $@
+
 $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
 	$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
 	$(AM_V_at)mv $@.tmp $@
diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
new file mode 100644
index 000000000000..1e6f7a46b4b7
--- /dev/null
+++ b/tests/system-afxdp-macros.at
@@ -0,0 +1,20 @@ 
+# Add port to ovs bridge by using afxdp mode.
+# This will use generic XDP support in the veth driver.
+m4_define([ADD_VETH],
+    [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77])
+      CONFIGURE_VETH_OFFLOADS([$1])
+      AT_CHECK([ip link set $1 netns $2])
+      AT_CHECK([ip link set dev ovs-$1 up])
+      AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \
+                set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"])
+      NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7])
+      NS_CHECK_EXEC([$2], [ip link set dev $1 up])
+      if test -n "$5"; then
+        NS_CHECK_EXEC([$2], [ip link set dev $1 address $5])
+      fi
+      if test -n "$6"; then
+        NS_CHECK_EXEC([$2], [ip route add default via $6])
+      fi
+      on_exit 'ip link del ovs-$1'
+    ]
+)
diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at
new file mode 100644
index 000000000000..9b7a29066614
--- /dev/null
+++ b/tests/system-afxdp-testsuite.at
@@ -0,0 +1,26 @@ 
+AT_INIT
+
+AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.])
+
+m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
+
+m4_include([tests/ovs-macros.at])
+m4_include([tests/ovsdb-macros.at])
+m4_include([tests/ofproto-macros.at])
+m4_include([tests/system-common-macros.at])
+m4_include([tests/system-userspace-macros.at])
+m4_include([tests/system-afxdp-macros.at])
+
+m4_include([tests/system-traffic.at])
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index bf4b6f8dc621..1f020e1c3825 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -3106,6 +3106,36 @@  ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         </p>
       </column>
 
+      <column name="other_config" key="xdpmode"
+              type='{"type": "string",
+                     "enum": ["set", ["skb", "drv"]]}'>
+        <p>
+          Specifies the operational mode of the XDP program.
+          If "drv", the XDP program is loaded into the device driver with
+          zero-copy RX and TX enabled. This mode requires device driver with
+          AF_XDP support and has the best performance.
+          If "skb", the XDP program is using generic XDP mode in kernel with
+          extra data copying between userspace and kernel. No device driver
+          support is needed. Note that this is afxdp netdev type only.
+          Defaults to "skb" mode.
+        </p>
+      </column>
+
+      <column name="other_config" key="xdpmode"
+              type='{"type": "string",
+                     "enum": ["set", ["skb", "drv"]]}'>
+        <p>
+          Specifies the operational mode of the XDP program.
+          If "drv", the XDP program is loaded into the device driver with
+          zero-copy RX and TX enabled. This mode requires device driver with
+          AF_XDP support and has the best performance.
+          If "skb", the XDP program is using generic XDP mode in kernel with
+          extra data copying between userspace and kernel. No device driver
+          support is needed. Note that this is afxdp netdev type only.
+          Defaults to "skb" mode.
+        </p>
+      </column>
+
       <column name="options" key="vhost-server-path"
               type='{"type": "string"}'>
         <p>