Metrics

Prometheus endpoint configuration

diode-send and diode-receive implements a metric system compatible with Prometheus. To enable this system, is it required to provide the url which will be scrapped by prometheus.

There are two urls to set, one for the sender and one for the receiver. They both are under the “metric” option.

[sender]
metrics = "127.0.0.1:9001"

[receiver]
metrics = "127.0.0.1:9001"

Note

When running on the same host for tests, it is necessary to put different ports or it will fail with the following error: Cannot init metrics: cannot start http listener: failed to create HTTP listener: Address already in use

Relevant metrics in diode-send

diode-send

  • tx_sessions : total number of TCP connections accepted by diode-send

  • tx_tcp_blocks : total number of blocks received on TCP sessions

  • tx_tcp_bytes : total number of bytes received on TCP sessions

  • tx_encoding_blocks : total number of blocks successfully encoded

  • tx_encoding_blocks_err : total number of blocks lost due to encoding error

  • tx_udp_pkts : total number of UDP packets successfully sent to diode-receive

  • tx_udp_bytes : total number of bytes successfully sent on UDP packets to diode-receive. This only is the udp payload without lidi header, this does not contain network transport headers of packets (Eth/IP/UDP). Since it contains repair packets and one raptorq header per block, the value is bigger than tx_tcp_bytes.

  • tx_udp_pkts_err : total number of UDP packets not sent (socket error)

  • tx_udp_bytes_err : total number of bytes not sent (socket error)

Relevant metrics in diode-receive

All stats of diode-receive starts with rx.

Processing pipeline is:

+---------------------+                 +---------------------------+                   +-----------------------+
| (udp sock) udp recv |   packets       | reorder + decoder         |   blocks          | tcp sender (tcp sock) |
|       rx_udp_*      | --------------> | rx_pop_*  + rx_reorder_*  | ----------------> |        rx_tcp_*       |
|                     |  reorder_queue  |   + rx_decoding_*         |  tcp_send_queue   |                       |
+---------------------+                 +---------------------------+                   +-----------------------+

UDP receive component

  • rx_udp_deserialize_header_err : total number of lost UDP packets due to corrupted header

  • rx_udp_recv_pkts_err : total number of read socket failure

  • rx_udp_send_reorder_err : total number of lost UDP packets because it was impossible to push it to the reorder/decode queue. Try to increase “udp_packets_queue_size” receiver config value, adjust tc rate limiting, or try to optimize RX performance receiver Multithreading.

Reorder and decoder component

Reorder queue :

  • rx_pop_reorder_queue_len : number of packets waiting in reorder queue (between udp receive thread and reorder/decoding thread)

  • rx_pop_udp_pkts : total number of UDP packets successfully received

  • rx_pop_udp_bytes : total number of bytes successfully received from UDP packets

  • rx_pop_ok_packets : total number of packets sent to reordering module and which completed blocks. Reordering module used this packet to complete a block and returns it. This value should be equal or inferior to rx_decoding_blocks. (Inferior because we can sometimes successfully decode a block even if we do not have all packets (see rx_pop_timeout_with_packets).

  • rx_pop_ok_none : total number of packets sent to reordering module, without finishing a block. Reordering module kept this packet and returned nothing, waiting for other packets to finish a block

  • rx_pop_timeout_with_packets : the current block did not receive the needed packets to complete it before a timeout occurs. We will try to decode the block and maybe succeed if we received enough data.

  • rx_pop_timeout_none : a timeout happens when there was no waiting packet for the current block.

Reorder component :

  • rx_reorder_flush_block_complete : next block is returned because we received all packets for this block

  • rx_reorder_flush_block_overflow : next block is returned because there are too many active blocks (50). This happens when there are missing packets in a session with many blocks.

  • rx_reorder_flush_block_expired : next block is returned because have not received any packet for this block for at least block_expiration_timeout milliseconds. This happens when there are missing packets for the last blocks of a session.

  • rx_reorder_flush_session_expired : current active session is closed because we have not received any packet for this session for at least session_expiration_timeout milliseconds.

  • rx_reorder_flush_nothing : no block is returned : there is nothing to decode now but there are waiting packets for the current session

  • rx_reorder_flush_nothing_inactive : no block is returned : there is nothing to decode and there is no waiting packet for the current session

Decoder and send :

  • rx_decoding_pkts_missing : total number of missing UDP packets when trying to decode blocks (packet drops, header error or queue full…).

  • rx_decoding_blocks : total number of blocks successfully decoded

  • rx_decoding_blocks_err : total number of blocks lost due to decoding error: too many packets missing or corrupted at the time of decoding.

  • rx_decoding_send_block_err : total number of lost blocks because it was impossible to push it to the TCP sender queue (most probably because it is full). Try to increase “tcp_blocks_queue_size” receiver config value or adjust sender/receiver TCP throughput.

TCP sender component

  • rx_tcp_send_queue_len : number of blocks waiting in send_queue (between reorder/decoding thread and tcp sender thread)

  • rx_tcp_drop_block : total number of blocks received and dropped by the TCP send component because TCP session is not established (init block for this session was missing).

  • rx_tcp_no_block : total number of messages received and dropped by the TCP send component due to decoding issue. Should be close to rx_decoding_blocks_err.

  • rx_tcp_blocks : total number of blocks sent on TCP session

  • rx_tcp_blocks_err : total number of lost blocks, not sent on TCP session (socket error)

  • rx_tcp_bytes : total number of bytes sent on TCP session

  • rx_tcp_bytes_err : total number of lost bytes, not sent on TCP session (socket error)

  • rx_tcp_sessions : total number of completed TCP sessions (last block received)

Kernel statistics

  • snmp_ip_in_discards : From kernel: the number of input IP datagrams for which no problems were encountered to prevent their continued processing, but which were discarded (e.g., for lack of buffer space). See RFC 1213.

  • snmp_udp_in_errors : From kernel: the number of received UDP datagrams that could not be delivered for reasons other than the lack of an application at the destination port. See RFC 1213.

  • cpu_usage : Same as top. Label is thread name.

Summary of data loss metrics (diode-receive side)

Packet loss metrics

If too many packets are lost, we will see block decoding error.

  • rx_udp_deserialize_header_err

  • rx_udp_send_reorder_err

  • rx_udp_pkts_missing

  • rx_udp_recv_pkts_err (maybe ? not sure of possible error case)

Block loss metrics

If a block is lost, the whole session is lost.

  • rx_decoding_blocks_err

  • rx_send_block_err

  • rx_tcp_blocks_err