[P4] any performance data on your P4ELTE/p4c compiler for DPDK ?

Wed Aug 3 11:15:20 CEST 2016

Hi,

A good starting point is the dpdk folder /src/hardware_dep/dpdk in the 
github repository that contains the DPDK specific code snippets 
including primitive functions, table operations and the raw packet 
definition. The raw packet representation is coming from the DPDK HAL, 
having an alias name packet to an rte_mbuf which is used by DPDK. From a 
pd packet_descriptor instance this raw representation can be accessed by 
pd->wrapper. It is only used in HAL functions.

The hw dependent initialization, main loop definition, etc. is also 
belongs to the HAL. We may revise this later, but this is how it works 
currently.

I hope it may help you to proceed. Our team is more or less available in 
the summer (with a bit longer response time), so please do not hesitate 
to ask if you have questions.

Best,
Sandor

On 2016.08.02. 16:41, Luigi Rizzo wrote:
> Hi,
> thanks for the reply (and hey, it's summer, it is expected
> that people is away).
>
> Really impressive numbers you have!
>
> (note to self: repo is at https://github.com/P4ELTE/p4c )
>
> We would be tempted to try and write a netmap (and possibly libpcap) HAL
> for your project, to see how much code it would be, and what kind of
> performance we can get.
>
> Any idea/suggestions on how to proceed ?
> Also, apart from the device I/O, how much does your code depend on
> DPDK e.g. for packet representation ?
>
> cheers
> luigi
>
>
>
> On Tue, Aug 2, 2016 at 3:55 PM, Sándor Laki <lakis at elte.hu> wrote:
>> Hi Luigi,
>>
>> Sorry for the delay; I was out of office for a few days.
>>
>> First of all, I have to point that our compiler only reuses the P4 front-end
>> called HLIR and the back-end has fully been reimplemented. In our case the
>> idea was to support multiple architectures, resulting in a core compiler
>> that uses a Hardware Abstraction Library to be implemented for the given
>> targets. HAL for DPDK is released with our compiler (see github) and for
>> Freescale NPU is under development.
>>
>> The first measurements with Intel NICs show 13.04 and 10.10 MPPS for L2 and
>> L3 examples on a single core setup with few entries in the tables. For L2,
>> we also have some preliminary scalability measurements in another setup with
>> 200 entries in tables smac and dmac; 17.5 MPPS with 2 cores, 27.8 MPPS with
>> 4 cores 33.4 MPPS with 16 cores (It seems to be the hw bottleneck (Mellanox
>> ConnectX-4); actually it was measured this morning). More comprehensive
>> experiments will be available soon.
>>
>> So this is where we are now. Next steps are to extend the coverage of P4
>> features, e.g. registers, packet re-circulation, etc.
>>
>> If I understand correctly, you improved the original P4-BM by adding the
>> mentioned enhancements. Your numbers are promising all the more that the
>> original P4-BM was not optimized for providing good performance (we only
>> analyzed BMv1 and after that we decided to fully rewrite it). If you have
>> any specific questions on our implementation, please do not hesitate to ask
>> me.
>>
>> Thanks for your mail and I hope that we can somehow collaborate and help
>> each other, making the European presence stronger in the P4 community.
>>
>> Ciao,
>> Sandor
>>
>> --
>> Sándor Laki, PhD
>> Assistant professor
>> Department of Information Systems
>> Eötvös Loránd University
>> Pázmány Péter stny. 1/C
>> H-1117, Budapest, Hungary
>> Room 2.506
>> Web: http://lakis.web.elte.hu
>> Phone: +36 1 372 2869 / 8477
>> Cell: +36 70 374 2646
>>
>> On 2016.07.27. 15:57, Luigi Rizzo wrote:
>>> Hi,
>>> I have read the thread on the p4-dev list regarding your
>>> P4 compiler targeting DPDK, and was wondering if you have
>>> some performance data (in terms of packets per second or similar)
>>> of your code, say for some simple P4 configuration.
>>>
>>> With my student Yuri we have been working on accelerating the
>>> reference p4 code on github (also adding support for netmap) and while
>>> we have very good results in accelerating I/O and queues,
>>> the main bottleneck is now in the ingress and egress stages.
>>>
>>> Roughly speaking (we will post this later to the list):
>>>
>>> - the reference P4 code has two operating modes:
>>>
>>>     SINGLE
>>>        all stages (input, parse, ingres, egress, deparse, output)
>>>        run in the same thread.
>>>     MULTI
>>>        processing is split in multiple threads (e.g. four)
>>>        connected by queues
>>>
>>>     The reference code has however very expensive queues so the
>>>     "MULTI" case is actually slower than the "SINGLE" one.
>>>     We measured some 4900 ns/pkt for SIMPLE_ROUTER, and 1400 ns/pkt
>>>     for L2 switch
>>>
>>> - Yuri and I made a number of enhancements to the queues,
>>>     making them lock free, and that made the MULTI case more
>>>     efficient, so the bottleneck is now the slowest stage in
>>>     the pipeline.
>>>
>>> - We also worked on the memory allocator (another significant
>>>     bottleneck) and added support for netmap.
>>>
>>> Overall, we are now down to about 500 ns per packet for the L2
>>> switch, and 1400 ns/pkt for the SIMPLE_ROUTER. We still have
>>> some room for improvement in the latter case.
>>>
>>> cheers
>>> luigi
>>>
>>
>> ---
>> Ezt az e-mailt az Avast víruskereső szoftver átvizsgálta.
>> https://www.avast.com/antivirus
>>
>
>

-- 
Sándor Laki, PhD
Assistant professor
Department of Information Systems
Eötvös Loránd University
Pázmány Péter stny. 1/C
H-1117, Budapest, Hungary
Room 2.506
Web: http://lakis.web.elte.hu
Phone: +36 1 372 2869 / 8477
Cell: +36 70 374 2646

---
Ezt az e-mailt az Avast víruskereső szoftver átvizsgálta.
https://www.avast.com/antivirus