Re: Bot Early Warning System

1minManager · 09-28-2022 07:35 AM

Like everyone, I've got an app thats running slow. Primarily because a particular table now has 35,000 rows and 8 virtual columns. Most of these actually don't need to show 'live' data so I'm going to change them over to a nightly Bot that calculates and then writes these figures into a spreadsheet column.

But it got me thinking. Once bots hit a certain duration, they time out. But we get no notification of that. So if I do the above, the Bot could be timing out everyday, my client is working off old data and I wouldn't know.

So the question here is - does anyone use a solution that is better than my proposed one below?

@MultiTech @Steve @WillowMobileSys @dbaum @Aurelien @SkrOYC @Suvrutt_Gurjar @Kirk_Masden

We can't get a bot to do something when it fails. But we can if it succeeds by having the last step do something. So my idea is to create a table called EWS with a row for each bot and columns of:

[Bot Name] [Run Frequency] [Run Time] [Last Run]

So if a bot runs it writes Now() to EWS[Last Run] and we have a seperate bot that checks the EWS table for anything thats not run. If it finds anything that's not run at the required time, it sends me an email.

Does anyone have anything more elegant?

MultiTech

I literally had this exact scenario happen to one of my clients that run a telehealth service

Several of the tables had grown in size, causing the VCS to become a source of slowdown
I moved the VCS into a physical column, creating a bot to run on a schedule daily to update these values
The bot originally worked, but then failed for some reason, with no notification that it failed
It wasn't until someone reached out to me noting that one of the new patients they just entered into the system wasn't showing up in a list
When I did some digging, I found out that the bot had been failing for weeks
Literally for weeks on end, everyone had been operating on the assumption that the data they were using was current and up-to-date
When in reality everything was weeks old

My solution was to reconfigure the bot so that it wouldn't fail; in particular this was a timeout issue, and my solution was twofold.

First I did some investigation on to why things were taking so long, and made some adjustments to reduce the size of the lists involved
I then split my automation into two parts, which is technically now three, each one handling its own portion of the processing load

By splitting things up, and reducing the size of what the system is working with, I was able to give myself at least another couple years worth of breathing space before I'd have to possibly rethink things again.

---------------------------------------------

I like the idea of creating a sort of log type of system yourself
Sure, it may be adds a lot of extra records being made by the system..
But I'd rather have peace of mind

If you create a record in your log table as the first step of your automation,

then go about doing the things that you're doing,

and then the last step of the automation DELETE the log record...

You would end up with a table with records inside it that corresponds to all of the automations that failed
If the automation would have finished, it would have deleted the log record - but since it failed, it's log record is still inside the table

Doing things this way will keep your log table relatively small

You just have to install a script to run on a schedule to remove any of the blank rows inside the sheet, because that's going to grow pretty large pretty quick.

But once you have all of this set up and in place and running, getting a report of all of the errors becomes incredibly easy.

All you'd have to do is create a secondary automation, set to run at a certain amount of time after the first, that looks for any records inside the log table. If there's anything inside there, it can send you those in an email, thus letting you know what failed.

I kind of like this system 🤔🤔🤔 I'm going to have to go back and implement this in a few of my apps

jyothis_m

Hi @MultiTech

@MultiTech wrote:
and then the last step of the automation DELETE the log record...

How is this implemented. I have added a new row to the log table when a BOT starts. Then when the BOT ends how can i delete the same row, more specifically how can i get the key of the row created in the earlier step.

dbaum

In case it's helpful: Use 'Call a process' and 'Return values' steps in ... - Google Cloud Community .

jyothis_m

Hi @dbaum Thanks for this. Its a wonderful explanation of the topic which is not clearly explained even in the appsheet documentation. But I am still struggling to implement it. Can you pls guide me with respect to this requirement. i.e. create a row in LOG table in one step of BOT of another Table. Then delete the same row of LOG Table at the last step of the BOT. (get the key of row created in earlier STEP)

dbaum

Sure. Following is an outline of how I would approach this using "Call a process" and "Return values" steps.

1. Call a process step in main process

2. Return values step in called process

3. Delete row step in main process

Nonetheless, that's probably not the only viable approach. For instance, in at least some scenarios (i.e., data change automations, but maybe not all scheduled automations) it should be possible to designate the new log row's key yourself rather than using a value generated by UNIQUEID--e.g., it could be the same key value as the key of the row that triggered the automation, or maybe you'd want to use that key value appended with something like "-log". In this case, of course, you could just use the "Add new rows" data change action rather than "Call a process" and "Return values"--and, no need for a secondary process.

jyothis_m

@dbaum Thank you so much 🤗 I have implemented it. This Call a process and return values were something that was over my head until your explanation. Thanks once again.

jyothis_m

@dbaum I made a small modification. Instead of deleting the row, I made a BOT Status column which is set to Started at first step of BOT and set to Completed at last step of BOT. If the Status is not Completed after 5 minutes of Start, then Status is changed to Error with help of another BOT. I am explaining below the way i implemented it in case someone finds it useful. Please share your thoughts/comments on this @dbaum @1minManager @WillowMobileSys @Aurelien @SkrOYC

STEP 1 ---> Created a BOT MONITOR Table with the Columns below

STEP 2 ---> Created a Process, BOT Start under BOT MONITOR Table

STEP 3 ---> Created a Process, BOT End under BOT MONITOR Table

STEP 4 ---> Added FIrst Step in all the BOTs to be monitored as START

STEP 5 ---> Added Last Step in all the BOTs to be monitored as END

STEP 6 ---> Created another BOT under BOT Monitor Table, which triggers on any ADD to the table and waits for 5 minutes. Then Checks BOT Status. If not completed, change the BOT Status to Error and send Notification in Pushover.

STEP 7 ---> Created another Scheduled BOT under BOT Monitor Table, to delete all completed Logs after 7 days.

BOT Monitor in App

Detail View

dbaum

Very robust! Thanks for sharing. Worth posting as a tip for better discoverability.

jyothis_m

Thank you @dbaum 😊 Will do that 👍

1minManager

@MultiTech wrote:
If you create a record in your log table as the first step of your automation,
then go about doing the things that you're doing,
and then the last step of the automation DELETE the log record...

This is a really good point...

SkrOYC

@1minManager wrote:
Most of these actually don't need to show 'live' data so I'm going to change them over to a nightly Bot that calculates and then writes these figures into a spreadsheet column

Yes, I did this myself for the same reasons. No need for a VC but at the same time we need some kind of updated data now and then, just not a complete recalculation every sync.

@1minManager wrote:
Once bots hit a certain duration, they time out. But we get no notification of that

YES! Recently I had this problem but fortunatelly I noticed it myself while doing a "Run" test of my Scheduled bot. I was a couple of days trying to understand why it was not working because there was no indicaiton on the monitor app. Finally I found it on the log analyzer and it was timing out after some time.

@1minManager wrote:
does anyone use a solution that is better than my proposed one below?

I haven't implemented it yet but my plan is to make bots that are data driven instead of scheduled and run actions just on the related rows plus some actions inside user interaction. You could separate this into categories:

Bots for 20+ rows touched/updated.
- So that when there is an add/update/delete, a bot triggers an update to the tables/rows that would need a recalculation in the dataset.
Actions for 20- rows touched/updated
- If the number of rows that need updating is low, or even is the same record a user is wanting to enter, trigger a grouped action to update the record just before the user gets the detail view and update other rows if needed via Execute an action on a set of rows; or
- If the change is what triggers the update, attach the action to form save, again using an Execute an action... just on the "affected" tables/rows

@1minManager wrote:
We can't get a bot to do something when it fails. But we can if it succeeds by having the last step do something. So my idea is to create a table called EWS with a row for each bot and columns of:
[Bot Name] [Run Frequency] [Run Time] [Last Run]
So if a bot runs it writes Now() to EWS[Last Run] and we have a seperate bot that checks the EWS table for anything thats not run. If it finds anything that's not run at the required time, it sends me an email.
Does anyone have anything more elegant?

This looks neat, I think I'll think about this in the future to be granted with a much better log system. As we already talked before, it's a shame that we are not notified about this failures on core. Your solution looks great.

Btw, the way I dealt with this problem of scheduled bot failing was by reducing the amount of rows, just that.
This will vary on usefulness. In my case I just made sure the bot runs on six-months old rows, more than that and it fails. It's ok for my use case because old records are just left as historical data in some sense.

Also important to know! Changing from virtual columns to "real" ones seems to sound as easy as just using AppFormula on real column, but consider if you really need that or Initial Value with Reset on edit would be enough. Since you can create an expression to evaluate whether reset the field or not, this could reduce the procesing made by the server when updating a row, since there are less expressions/fields involved on the whole row update.

Extra tip just in case someone doesn't know:
If the AppSheet table column names are the same as the worksheet ones, a Regenerate structure shouldn't break anything, keeping everything intact. Now, if you add a new column to your worksheet with the same name as a virtual column, it keeps the column config in place but it's now a real column. Bad news is that the data needs to be populated anyway.

Extra question: What's the method you are using to update your now-real columns? A bot that uses an action to update one field and then others are evaluated via their AppFormula? A bot that uses an action to update all now-real column fields with their respective formulas inside the action?

dbaum

@SkrOYC wrote:
if you add a new column to your worksheet with the same name as a virtual column, it keeps the column config in place but it's now a real column

This indeed is a great technique and has helped me many times. More details for anyone interested: Convert a virtual column into a real column--or the reverse.

SkrOYC

@MultiTech wrote:
You just have to install a script to run on a schedule to remove any of the blank rows inside the sheet, because that's going to grow pretty large pretty quick

In my experience using MSExcel, if AppSheet deletes a row (or remove the content, as we all know it doesn't deletes the actual spreadsheet row) and the blank space is at the bottom of the spreadsheet, new data is added in the blank space. In other words, AppSheet adds content just after the last row with data. So it shouldn't be a problem unless a bot deletes data that was added before another bot already added more.

Idk if this translates to the way GSheets works

1minManager

Google sheets operates the same, it will overwite blank rows at the bottom but not blank rows above any data

WillowMobileSys

I think your proposed solution is good. But do you have an Enterprise plan?

It is my understanding that those on Enterprise can configure the plan to send emailed notifications of errors. Bot timeout is an error that is logged - at least in the Monitor panel - I would think these are emailed as well??

If you don't have an Enterprise plan then implementing your Early Warning System is a great idea to help keep costs down but still get notification of potential issues.

However, I would like to offer this observation. 35,000 rows is really not that many rows and should be able to be processed with the timeout allotment AppSheet provides. Design is key. You may want to consider redesigning the process because if it is slow now, it will only get worse over time as the data size grows. It will not be sustainable.

There ARE ways to improve the processing time - even in Appsheet. If you would like to pursue that avenue, let us know. I will also offer to help off platform if that is anything you would be interested in.

SkrOYC

@WillowMobileSys wrote:
Bot timeout is an error that is logged - at least in the Monitor panel

Is under the "Log analyzer", just not something you see unless you 1) Notice your bot is not working and 2) Go to that specific page

@WillowMobileSys wrote:
35,000 rows is really not that many rows and should be able to be processed with the timeout allotment AppSheet provides

I wasn't able to do this with more than hundreds, this seems to be tied to the amount of computation you expect for each row. I have around 14 AppFormulas that need to be computed, just one of them involving a FILTER() inside a SELECT() to a table with less that 680 rows and my bot timed out anyway. I guess enterprise users can get more computation power on server as well, this can be also a problem in my case

WillowMobileSys

You forgot the "Design is key." part!

My point was that an AppSheet app can handle that many rows - even in non-Enterprise apps. The process needs to be properly designed and implemented within the confines of the platform.

SkrOYC

In order words, if the calculation is not intensive, we shouldn't worry about rows amount that much?

WillowMobileSys

No, not exactly. To start with, without experience, YES build the easiest solution. But then run it and evaluate the run time. If the process does run too long, then it's time to be crafty! We then have to understand the constraints and build around them.

To be honest, I would do (and have done) exactly what @1minManager has done. Build the process first and then react to any issues. That is because I, like most of us, don't yet have enough aggregate experience to know ahead of time where the issues might be. I think we are all learning something valuable with this thread as we build larger more demanding apps in that we need to consider run time length for any suspected long running processes.

I do want to point out that the issue highlighted here is not new. Even extremely large systems have this consideration about long running processes - even more so. They have learned over time to just plan for post-processing capabilities with any new implementation

It may be they decide that running a process overnight is just fine, so start the process and let it rip. BUT they have the luxury of allowing that process to run to completion no matter how long with no constraints.

Usually, though, large companies DO have end-of-day processing and need these processes to run within a window of time. On many occasions those processes need to be cleverly crafted so that much of the work is done throughout the day so that when it comes time to finalize results, those end-of-day processes do not need to do as much work- i.e. are faster. Then, typically, these end-of-day results feed into a reporting system for daily reports and into the end-of-month results...and so on.

MultiTech

@SkrOYC wrote:
Bad news is that the data needs to be populated anyway

What's the method you are using to update your now-real columns?

What I do for this is a multi-staged process:

Cut the formula from the App Formula space, removing it from the column's settings
- You'll need to save so this change propagates through the system
Create a data change action that sets the value of the column, using the formula you cut
Create a ref update action, to run your data change action on the rows you need to update

From here..... you could simply push the button - and it would run the data change action on all the rows; but you'd have to wait for all the syncs to upload, since these data changes would be made client-side.

Instead what I do is create a scheduled bot, and use the "RUN" button to run my action. This accomplishes the same thing, essentially, but it's done through automation (using parallel processing) and is done really fast.

Create a scheduled bot
Event details
- It's event name doesn't matter
- It needs to be scheduled
  - the specifics of the scheduling doesn't matter
- "For each row" needs to be true
Task details
- The name of the task doesn't matter
- Task type: data change
- Specific action = the action you made previously
Save, so it propagates
Tap the run button
Once the bot is finished, you can easily check the data source and see if it's complete, make sure to DISABLE the bot
- this way it's not running on that schedule you didn't bother setting any details for. 😉

-----------------------------------------------------------------------------------

I actually did something like this on last Friday's live stream; only in this instance I was adding new columns, with new data, to the source table and copying them to the target table.

But the mechanics of what you need to do (the actions + automation setup) is the same

Here's a timestamp to the point in the stream when I begin this process.

Aurelien

@1minManager @1minManager @SkrOYC @WillowMobileSys

All of these ideas are awesome, thank you for sharing.

@1minManager wrote:
Once bots hit a certain duration, they time out. But we get no notification of that.

Fortunately I didn't run yet into this kind of issue, though I noticed that a bot running an Apps Script and failing won't return any error, at least from my experience.

Because of it, I decided to run the Apps Script function directly from the script scheduler, not AppSheet bot.

@Roderick you may want to inform the dev team of these issues for any improvement ? Maybe it has been done already.

@WillowMobileSys wrote:
It is my understanding that those on Enterprise can configure the plan to send emailed notifications of errors. Bot timeout is an error that is logged - at least in the Monitor panel - I would think these are emailed as well??

@lizlynch Can you bring any clarification on this point ? I did a quick search in the documentation, but didn't find anything.

In my humble opinion, such time out error alert should be a default setting, or at least a minified log with last error timestamp maybe in the app editor, as we can have under Manage > Monitor > Usage Statistics ?