AirVantage asset data commands are sometimes very slow to reach target


#1

Hello all,

I’ve recently been working on issuing commands to our hardware using AirVantage asset data commands and the AirVantage REST API to issue these commands. For the most part, it works well and responds quickly. However, roughly 1 in 10 requests seem to take up to 5 minutes to arrive on the hardware. Is this normal/expected behaviour? If so, where is the bottleneck?

Any help is greatly appreciated. Let me know if you any logs or debug info would be helpful and I can provide it.

Cheers!


#2

Example operation that took ~2 minutes


#3

Hey all,

I’m still seeing this, but it seems worse lately. It’s not uncommon for them to just never get delivered period now.

I’ve been sure to check that my target and operations are using the same path:

May 29 21:00:01 swi-mdm9x15 user.info Legato:  INFO | avcDaemon[1019]/avcDaemon T=main | avData.c le_avdata_AddResourceEventHandler() 2460 | Registering handler on /syncConfig/commands/takePhoto


#4

@nick

I have had similar issues with commands taking a long time to execute. What version of Legato are you currently running? I’ve seen much better AirVantage connection performance (upstream & downstream) using 18.04 with the mangOH legato patches applied.


#5

Hey @drewwestrick,

We’re running the same setup (18.04 and patches) and it still performs poorly. It seems today that none of our requests are getting delivered period.


#6

@nick

Yeah, I’ve definitely seen things be hit or miss with AV especially on the data receive side of things.

I’m assuming you can do a cm data connect and ping 8.8.8.8 with no problems?

Is it just on the data receive side of things or are you also seeing any delays / issues on pushing messages?


#7

@rjacolin can you look into this?


#8

@dclark75 can you look into this as well?

@nick can you share the log files?


#9

Hey @drewwestrick,

Units are connecting fine as we can see them pushing time series data to AirVantage.

Overall I really like the design and API of AirVantage, but reliability from the target has been marginal at best. Pushing data (primarily with time series) works well for the most part, but we have seen publishes begin to fail after being deployed in the field for a while. We have a watchdog in place to catch this (reboot fixes it), but it’s annoying and suboptimal that the device is required to reboot regularly.

Sending data to the target has been much more unpredictable. A few days ago I was receiving commands almost instantly, then delayed by 2 to 10 minutes, and then not at all. I’m not really sure how to explain this variance. We’ve been sure to keep Legato up to date along with those patches in the hopes that this improves, but it’s still proving pretty problematic.

I have been strongly considering prototyping a connection service implemented over MQTT (@dfrey already did some work on an MQTT client). AirVantage provides excellent abstraction that is powerful to developers, but abstraction is frustrating if it performs unpredictably.


#10

Which logs are of primary interest? Things look normal from the target when this problem occurs (connected and pushing data).


#11

To all,

A problem here may be related with the patch 18.04.0/wakaama/0001-Fix-processing-of-duplicate-message-IDs-p2.patch. Try removing this patch from the Legato 18.04 and rebuild. Before I went away on vacation the Legato team showed me there is a problem with this patch. FYI though that removing this patch will add back in a bug where multiple messages may make it to the device due to UDP communication not filtering duplicate messages. This work is in progress for 18.05


#12

Hi @nick,

Do pending commands get delivered to your target when:

  • you start a new connection to AirVantage?
  • you send data to AirVantage?

When there is no data exchange for a while (20s or more), the network might release the NAT. In this event, AirVantage can no longer send commands to your device (as external IP address of the device is no longer valid). IP will be refreshed on the next upstream data (new connection or data posting).

Regards
Nhon


#13

Hey @NhonChu, it really does not seem to make a difference. I even tried sending a command while a unit was off, and then watched the logs while it connected to AirVantage and it never saw the command.


#14

Hey @dclark75,

I reverted that patch. Did not really seem to make much of a difference.


#15

Hi @nick,

Could you enable the log with debug option turned on (log level debug).
Then try sending a command while the device is not connected, and then capture logs while it connect to AirVantage

Thanks


#16

Hey @NhonChu,

Here are the logs for avcService after stopping it, setting log level debug, sending a command, and then starting avcService again.

I did have a hard time grabbing these logs as my WP85 did not want to connect. I kept getting stuck with this:

root@swi-mdm9x15:~# cm radio
Power:                 OFF
Current Network Operato
Current RAT:           Not available
Status:                Not registered and not currently searching for new operator (LE_MRC_REG_NONE)
Signal:                No signal strength (0)
PS:                    Packet Switched Unknown state (LE_MRC_REG_UNKNOWN)

I tried invoking cm radio off, cm radio on, and cm data connect to get it online, but it didn’t seem to work. I got it working again after a few reboots. I’m starting to think their may be an issue in the modem service somewhere. I’ve been considering stripping out the wifiService dependency since we are not using wifi right now.

Cheers!


#17

Hi @nick,
Thank you for capturing the log.

Is “/syncConfig/commands/takePhoto” the command you sent from AirVantage ?
According to the log, your device receives this command twice:

Jun 3 06:01:15 swi-mdm9x15 user.info Legato: INFO | avcDaemon[1317]/avcDaemon T=main | avData.c AvServerRequestHandler() 2393 | >>>>> Request Uri is: [/syncConfig/commands/takePhoto]
Jun 3 06:01:15 swi-mdm9x15 user.debug Legato: DBUG | avcDaemon[1317]/avcDaemon T=main | avData.c ProcessAvServerExecRequest() 2304 | >>>>> COAP_POST - Server executes a command on device

The second one is at 06:01:16.

Does the command remain in “in progress” on the portal ?

Thanks


#18

Hey @NhonChu,

In this particular instance it did receive the command and AirVantage showed one as failed and one as completed. It definitely works sometimes, but to date, performance has been too unreliable for this to be something we can ship to customers.

This seems to really come down to the connection state of the target. If the target is freshly powered on, and freshly connected (with good cell signal), things seem to work OK. Units that have been deployed in the field, or units that have a lousy connection (~2 bars) seem to receive commands very late or not at all.


#19

Hi Nick,

Unless we get a log that points to an issue on the device, it is hard to answer. I would try though.

As you know the response to a command is sent asynchronously from the app. From the logs it is clear that the device received 2 commands and responded to both of them.

Jun 3 06:01:15 swi-mdm9x15 user.debug Legato: DBUG | avcDaemon[1317]/avcDaemon T=main | osDebug.c lwm2m_printf() 84 | [prv_async_response:832] total payload length = 0
Jun 3 06:01:16 swi-mdm9x15 user.debug Legato: DBUG | avcDaemon[1317]/avcDaemon T=main | osDebug.c lwm2m_printf() 84 | [prv_async_response:832] total payload length = 0

But I don’t know the response code sent from the app. You indicated that one operation failed and the other was successful. This could be due to the error code sent from the app using le_avdata_ReplyExecResult(). Also, it is important to note that if the app fails to respond, the next job on the server could be stuck waiting for the current operation to complete. Can you please track the response codes and timeline of app’s responses?

The original issue you reported is centered around performance and commands taking a long time to arrive. I don’t see a bottleneck in your use case. However, CoAP retransmission is slow. If you have a lossy network, it is possible that the NAT is released before all retransmission attempt is complete. At this point, the device can receive command only at next successful connection. How frequently do you initiate a connection to the server?

Time series or data push will work even when NAT is released as the device performs DTLS resume to refresh the connection.

Can you clarify on the following items to verify if the issue is related to poor network connection?

  1. What is the signal strength when you see this issue?
  2. The network operator and NAT_TIMEOUT (depends on operator)?
  3. What is the ping time (to get a sense of latency of the network). On my device, I have the following compared to 15 ms on my Desktop.

root@swi-mdm9x28:~# ping bs.airvantage.net
PING bs.airvantage.net (52.24.215.63): 56 data bytes
64 bytes from 52.24.215.63: seq=0 ttl=233 time=49.216 ms
64 bytes from 52.24.215.63: seq=1 ttl=233 time=48.255 ms

Regards,
Prush


#20

Hey @Prush,

The logs provided show the system functioning as intended. The failed response in that instance is expected (our app will decline a request if its already busy).

Our devices initiate a connection and AirVantage session on power up and thats it. The intent is for the device to stay online and connected at all times.

  1. We’ve seen it happen at all signal strength levels, but it seems to happen more frequently when the signal goes below 2 bars.
  2. We are using Sierra SIM cards, and they usually seem to connect to Rogers around here. How do I find out about NAT_TIMEOUT?
root@swi-mdm9x15:~# ping bs.airvantage.net
PING bs.airvantage.net (52.24.215.63): 56 data bytes
64 bytes from 52.24.215.63: seq=0 ttl=234 time=616.573 ms
64 bytes from 52.24.215.63: seq=1 ttl=234 time=575.522 ms
64 bytes from 52.24.215.63: seq=2 ttl=234 time=534.046 ms
64 bytes from 52.24.215.63: seq=3 ttl=234 time=494.674 ms
64 bytes from 52.24.215.63: seq=4 ttl=234 time=994.170 ms
64 bytes from 52.24.215.63: seq=5 ttl=234 time=713.535 ms
64 bytes from 52.24.215.63: seq=6 ttl=234 time=673.096 ms
64 bytes from 52.24.215.63: seq=7 ttl=234 time=511.857 ms

Thanks for the help @Prush. Let me know if you need any additional information. We are very motivated to resolve these issues.