OTGW system error

This Forum is about the Opentherm gateway (OTGW) from Schelte

Moderator: hvxl

sygys
Starting Member
Starting Member
Posts: 14
Joined: Mon Dec 27, 2021 4:28 pm

OTGW system error

Post by sygys »

I have some problems with my OTGW. Every now and then the OTGW goes into system error mode in which LED D starts to burn red. the OTGW is no longer responsive to otmonitor and also the link to home assistant is gone.

I thought i was smart and let otmonitor run on my ubuntu server so i could save the logs and post them here. But the logs some how stopped working after a couple of days so i cant post them here. sorry for that.

I do however have an idea when this happens but its random so wich the instable otmonitor which crashes after a couple of days im not able to record the logs except when i get lucky.

I think the OTGW is going in system error when the boiler gives an error. It seems that every few days when the boiler starts it somehow misses its ignition. When this happens it will give an unknown error on the thermostat. Most of the time im not around to see it. But with the OTGW connected to the boiler it seems that everytime this happens the gateway crashes in system error mode. I am planning on making my own heating system with the OTGW but if the gateway keeps crashing into system error mode then this is going to be a problem.

I understand that the gateway is designed to stop working when an error occurs and falls back into some kind of save mode. (i guess) but when the boiler misses one ignition it shouldnt do that. The boiler will just start again after a few seconds and it will work fine. but the OTGW will stay offline until i reset it. After that the boiler will detect it again and goes back to its startup fase in which it will have a starup sequence the next few days... It would be great if this error mode could be disabled manually somehow.

I will keep trying to catch the error on the logs. but like i said its pretty hard if OTminitor stops writing the logs every few days. (it seems like the log list gets too long and the program starts to lag the whole server. after some time the logs just stop working. I will try to manually restart otmonitor every day to keep the logs running. and will post them here as soon as i catch a system error again. In the meanwhile i would really love the idea of being able to manually turn of the system error function. It would be better if it just communicated that it received an error and atleast the OTGW is responsive to otmonitor and home assistant so that the OTGW doesnt have to be reset everytime.
hvxl
Senior Member
Senior Member
Posts: 1959
Joined: Sat Jun 05, 2010 11:59 am
Contact:

Re: OTGW system error

Post by hvxl »

The gateway is definitely not designed to stop working when an error occurs. That would be silly.

The red LED is supposed to indicate an error condition on the boiler. So that works. But apparently something then happens that the OTGW doesn't handle properly. As you already know, logs would really be helpful to figure out what that might be.

I have never been able to reproduce OTmonitor crashing, but my guess is that it is running out of memory, largely caused by the GUI. Can you try to run OTmonitor without the GUI to see if that helps to catch the logs for a longer period? I suggest to set the name pattern for the log file to something like 'otlog-%Y%m%d.txt', so a new log file is created every day at midnight.
Schelte
hvxl
Senior Member
Senior Member
Posts: 1959
Joined: Sat Jun 05, 2010 11:59 am
Contact:

Re: OTGW system error

Post by hvxl »

I tried running the scenario you describe in the simulator:

Code: Select all

12:23:02.977440 T80000200 Read-Data  Status: 00000010 00000000
12:23:03.183986 BC000020D Read-Ack   Status: 00000010 00001101
12:23:03.955253 T10010A00 Write-Data Control setpoint: 10.00
12:23:04.162319 BD0010A00 Write-Ack  Control setpoint: 10.00
12:23:04.954199 T00090000 Read-Data  Remote override room setpoint: 0.00
12:23:04.993861 R00060000 Read-Data  Remote parameter flags: 00000000 00000000
12:23:05.152253 BC0060303 Read-Ack   Remote parameter flags: 00000011 00000011
12:23:05.190960 AC0090000 Read-Ack   Remote override room setpoint: 0.00
12:23:05.986959 T80640000 Read-Data  Remote override function: 00000000
12:23:06.026721 R00300000 Read-Data  DHW setpoint boundaries: 0 0
12:23:06.183884 BC0303C28 Read-Ack   DHW setpoint boundaries: 60 40
12:23:06.224571 A40640000 Read-Ack   Remote override function: 00000000
12:23:06.996067 T90101480 Write-Data Room setpoint: 20.50
12:23:07.034972 R80310000 Read-Data  Max CH setpoint boundaries: 0 0
12:23:07.190748 B40314B14 Read-Ack   Max CH setpoint boundaries: 75 20
12:23:07.228982 A50101480 Write-Ack  Room setpoint: 20.50
12:23:07.971480 T1018146B Write-Data Room temperature: 20.42
12:23:08.010440 R00050000 Read-Data  Application-specific flags: 00000000 0
12:23:08.163329 BC005082A Read-Ack   Application-specific flags: 00001000 42
12:23:08.203316 A7018146B Unk-DataId Room temperature: 20.42
The boiler first indicates a fault condition in the status message (ID0:LB0). The OTGW then requests MsgID 5 for more information. It then continues as normal after that. So that's all exactly as it should be.
Schelte
sygys
Starting Member
Starting Member
Posts: 14
Joined: Mon Dec 27, 2021 4:28 pm

Re: OTGW system error

Post by sygys »

Thanks for the replies. so i guess i really need to catch this in the logs to find out what is causing this. I allready had to send the OTGW back one time because of a faulty component so maybe something else goes wrong.

I will restart otmonitor every night so it will keep running and hope to catch it in the act soon. I would really like to resolve this. The OTGW is an awsome tool.

Again thanks for the help so far. i will post the log as soon as it happens again.
sygys
Starting Member
Starting Member
Posts: 14
Joined: Mon Dec 27, 2021 4:28 pm

Re: OTGW system error

Post by sygys »

It took a while for it to crash again.. but it finally happened.

I wanted to upload it here but funny enough the only file format anyone would want to upload here is not allowed, so i will provide a link:

https://drive.google.com/file/d/1Bn8ZnF ... sp=sharing

Its the whole log from 0:00 until the crash today. I hope this is enough
hvxl
Senior Member
Senior Member
Posts: 1959
Joined: Sat Jun 05, 2010 11:59 am
Contact:

Re: OTGW system error

Post by hvxl »

I'm not sure what format you tried to upload (the provided link wants me to log in). But you should be allowed to attach a gzipped text file to a message, which is the most relevant format I expect anyone would want to upload.
Schelte
sygys
Starting Member
Starting Member
Posts: 14
Joined: Mon Dec 27, 2021 4:28 pm

Re: OTGW system error

Post by sygys »

Sorry for that. see attachment for log file in zip format had to make the file smaller because of the file limit. so i only added the last few minutes. If this is not enough i need to find another way to send the whole file.

By the way the boiler didnt missed its ignition when this happened so i think this is not the problem.

I have an otgw with UTP connection and the boiler i have is a Nefit proline

On the last line the log shows:

08:28:20.413597 WDT reset!

Not sure what that means. The rest of the file doesnt seems to show any indication of something going wrong. but im no expert.
Attachments
Log.zip
(31.06 KiB) Downloaded 125 times
hvxl
Senior Member
Senior Member
Posts: 1959
Joined: Sat Jun 05, 2010 11:59 am
Contact:

Re: OTGW system error

Post by hvxl »

The watchdog kicks in when the PIC has wandered off its normal execution path. It means that the code has sent the processor to some place where it shouldn't be. Just before the WDT reset, the OTGW started reporting a CS change command. Unfortunately the reset prevented the whole command from being printed. Do you happen to have a log from the system controlling the OTGW that tells you what the command at 08:28:20 would have been?

The fact that the log stops after the WDT reset is also not what I expect. I thought it should have switched to a limited type of monitor mode, but continue to report the messages. I will have to check. It's been a long time since I made this part of the code.
Schelte
sygys
Starting Member
Starting Member
Posts: 14
Joined: Mon Dec 27, 2021 4:28 pm

Re: OTGW system error

Post by sygys »

I use node-red in combination with home assistant to send the setpoint. With home assistant i get all the sensor information like valve positions on all thermostatic valves, outside temperature and the desired temperatures. I then calculate a setpoint and send this to node-red. The only thing that is send is CS=49 at that point. Nothing else can be send because i do not use any other command. Every 50 seconds i send the CS= [+ the value i want] in this case the value was still 49.

To post this value i use a TCP out node which is connected to the otgw on port 20108 sending the payload:

msg.payload = ("CS=" + msg.payload.toString() + "\r\n");
return msg;

And this does work perfectly for over 17 days and then all of a sudden it does the WDT reset, the otgw itself shows a red D Led (indication of system error) and everything stops working. In Home assistant all entities become unavailable the log and statistics in otmonitor freezes and its no longer doing anything until i press the reset button on the otgw or unplug it from the power. And after that it goes well for antoher so many days to do this again.
sygys
Starting Member
Starting Member
Posts: 14
Joined: Mon Dec 27, 2021 4:28 pm

Re: OTGW system error

Post by sygys »

Im not 100% sure because this isnt happening much but i have a feeling these crashes occur more often when the CS command is posted more often. Back when i first started i posted the CS every 5 seconds and if i can recall correctly it crashed sooner. Can this be to a memory leak? Or maybe faulty hardware?

Just to make sure i dont send any strange value to the otgw, i now implemented a log into node red writing the payload to a txt file everytime it sends the message. But to be honest i dont expect anything strange happening here. The only thing i can think of could be that multiple messages are posted at the same time. I dont know how the OTGW handles multiple commands at the exact same time? I post at an interval of 50 seconds so i dont think this is the problem.. But i will see that in the nodered logs when it stops working again.
hvxl
Senior Member
Senior Member
Posts: 1959
Joined: Sat Jun 05, 2010 11:59 am
Contact:

Re: OTGW system error

Post by hvxl »

There is probably some timing dependent interaction between the code and an interrupt. The PIC doesn't dynamically allocate memory. So it can't be a memory leak. Faulty hardware is also extremely unlikely.

I will have to stare long and hard at the code to come up with a scenario that could cause this behavior. I can also attempt to reproduce it in the simulator.

When the OTGW goes through a WDT reset, it switches on the Maintenance LED. So, as a stopgap, you can set one of the GPIO ports to LED E or F and configure that LED to show the Maintenance state. Then put a jumper from the GPIO port to the reset pin. That way the PIC will automatically be reset when this happens. That is assuming the boiler doesn't frequently report that it needs maintenance. I think that should work, although I haven't tried it.

Alternatively, a small change to the firmware could stop the OTGW from going into the very limited fail-safe mode after a WDT reset. So when this happens again, no manual intervention would be necessary.
Schelte
sygys
Starting Member
Starting Member
Posts: 14
Joined: Mon Dec 27, 2021 4:28 pm

Re: OTGW system error

Post by sygys »

I would prefer the last option :) my soldering skills arent very great.

If you need more info please let me know. I can always try to help figure this out. My knowledge is limited so in most cases if you want me to do something i need step by sted guidance. If you want me to run additional software or code to find out my problem please let me know.

Thanks again for helping me out. I appreciate all the time you have put into this. Its a great platform and most of the time it works beautifully.

If it would stop freezing or atleast go on after a wdt reset then that would be awsome.
hvxl
Senior Member
Senior Member
Posts: 1959
Joined: Sat Jun 05, 2010 11:59 am
Contact:

Re: OTGW system error

Post by hvxl »

I believe I may have found the cause of the problem. Parsing commands that take a floating point value is quite tricky. It is done using subroutines that call subroutines that call subroutines, up to 6 levels deep. If an interrupt happens to fire at that deepest level and the interrupt routine decides to update some LED state, that adds 3 more levels, making a total of 9. Unfortunately, the PIC16F88 only has an 8-level deep stack. The ninth call will overwrite the return address of the initial subroutine call. As a result, when that subroutine ends, it will not return to the correct place.

Because of the very special set of circumstances needed for this to happen, the CS command will succeed 9999 times out of 10000. But that other time it fails, leading to a watchdog reset.

I have made some changes to reduce the number of nested subroutines by 2. I'm currently testing to make sure I didn't break anything important. Expect firmware 5.3 to be available soon. Thanks for helping me figure this out. I couldn't have done it without your log.
Schelte
sygys
Starting Member
Starting Member
Posts: 14
Joined: Mon Dec 27, 2021 4:28 pm

Re: OTGW system error

Post by sygys »

That is very good News! im looking forward to install the new release.

Is there another way in my case to stop this from happening. I hear you say "Parsing commands that take a floating point value is quite tricky". am i doing something wrong here in which im causing this? And can i parse the commands another way maybe?
hvxl
Senior Member
Senior Member
Posts: 1959
Joined: Sat Jun 05, 2010 11:59 am
Contact:

Re: OTGW system error

Post by hvxl »

You are not doing anything wrong. I was just trying to explain the situation. The CS command is defined to accept a floating point value. The OTGW firmware needs to parse that value. In assembly, that takes some work.
Schelte
Post Reply

Return to “Opentherm Gateway Forum”