[P4] any performance data on your P4ELTE/p4c compiler for DPDK ?

Tue Aug 2 15:55:27 CEST 2016

Hi Luigi,

Sorry for the delay; I was out of office for a few days.

First of all, I have to point that our compiler only reuses the P4 
front-end called HLIR and the back-end has fully been reimplemented. In 
our case the idea was to support multiple architectures, resulting in a 
core compiler that uses a Hardware Abstraction Library to be implemented 
for the given targets. HAL for DPDK is released with our compiler (see 
github) and for Freescale NPU is under development.

The first measurements with Intel NICs show 13.04 and 10.10 MPPS for L2 
and L3 examples on a single core setup with few entries in the tables. 
For L2, we also have some preliminary scalability measurements in 
another setup with 200 entries in tables smac and dmac; 17.5 MPPS with 2 
cores, 27.8 MPPS with 4 cores 33.4 MPPS with 16 cores (It seems to be 
the hw bottleneck (Mellanox ConnectX-4); actually it was measured this 
morning). More comprehensive experiments will be available soon.

So this is where we are now. Next steps are to extend the coverage of P4 
features, e.g. registers, packet re-circulation, etc.

If I understand correctly, you improved the original P4-BM by adding the 
mentioned enhancements. Your numbers are promising all the more that the 
original P4-BM was not optimized for providing good performance (we only 
analyzed BMv1 and after that we decided to fully rewrite it). If you 
have any specific questions on our implementation, please do not 
hesitate to ask me.

Thanks for your mail and I hope that we can somehow collaborate and help 
each other, making the European presence stronger in the P4 community.

Ciao,
Sandor

-- 
Sándor Laki, PhD
Assistant professor
Department of Information Systems
Eötvös Loránd University
Pázmány Péter stny. 1/C
H-1117, Budapest, Hungary
Room 2.506
Web: http://lakis.web.elte.hu
Phone: +36 1 372 2869 / 8477
Cell: +36 70 374 2646

On 2016.07.27. 15:57, Luigi Rizzo wrote:
> Hi,
> I have read the thread on the p4-dev list regarding your
> P4 compiler targeting DPDK, and was wondering if you have
> some performance data (in terms of packets per second or similar)
> of your code, say for some simple P4 configuration.
>
> With my student Yuri we have been working on accelerating the
> reference p4 code on github (also adding support for netmap) and while
> we have very good results in accelerating I/O and queues,
> the main bottleneck is now in the ingress and egress stages.
>
> Roughly speaking (we will post this later to the list):
>
> - the reference P4 code has two operating modes:
>
>    SINGLE
>       all stages (input, parse, ingres, egress, deparse, output)
>       run in the same thread.
>    MULTI
>       processing is split in multiple threads (e.g. four)
>       connected by queues
>
>    The reference code has however very expensive queues so the
>    "MULTI" case is actually slower than the "SINGLE" one.
>    We measured some 4900 ns/pkt for SIMPLE_ROUTER, and 1400 ns/pkt
>    for L2 switch
>
> - Yuri and I made a number of enhancements to the queues,
>    making them lock free, and that made the MULTI case more
>    efficient, so the bottleneck is now the slowest stage in
>    the pipeline.
>
> - We also worked on the memory allocator (another significant
>    bottleneck) and added support for netmap.
>
> Overall, we are now down to about 500 ns per packet for the L2
> switch, and 1400 ns/pkt for the SIMPLE_ROUTER. We still have
> some room for improvement in the latter case.
>
> cheers
> luigi
>

---
Ezt az e-mailt az Avast víruskereső szoftver átvizsgálta.
https://www.avast.com/antivirus