Solved: Multirows Functions and Expressions- Processing Se...

Suvrutt_Gurjar · 02-17-2021 07:38 AM

We all try to optimize the efficiency of multi row functions /expressions such as SELECT() and other derivatives of SELECT() , MAXROW(), MINROW(), FILTER(), LOOKUP() etc.

May we know how the expression processing in actual table takes place when AppSheet processes it. Is it evaluated sequentially row-wise to see if the conditions are met?

As an example if an expression is say SELECT( Orders[Order ID], AND( [Order Statue]= “Open”, [Order Year]= 2016))

Will the condition for the year and status be evaluated sequentially from row 2 to the last row of the table?
If the condition is evaluated sequentially for rows, then if the app creator knows that say all year 2016 records are before row number 500, then will putting a condition such as say [_ROWNUMBER}<=500 will help in the SELECT() function ignoring remaining rows of the table in processing?

In summary, if there is any sequence that the multirow expressions follow in processing, is it possible that it can be made use of in improving sync time optimization?

Steve

I can only speculate.

As I understand it, the AND() and OR() functions do NOT short-circuit, so the order of the subexpressions has ZERO effect on performance. In fact, using AND() and OR() would therefore always deliver worst-case performance.

Conversely (as I understand it), IF(), IFS(), and SWITCH() can be used to short-circuit evaluation, and therefore COULD have a performance benefit.

Therefore this:

AND(
  expr1,
  expr2,
  ...,
  exprN
)

might be better expressed as:

IFS(
  NOT( expr1 ), FALSE,
  NOT( expr2 ), FALSE,
  ...,
  NOT( exprN ), FALSE,
  TRUE, TRUE
)

And this:

OR(
  expr1,
  expr2,
  ...,
  exprN
)

might be better expressed as:

IFS(
  expr1, TRUE,
  expr2, TRUE,
  ...,
  exprN, TRUE,
  TRUE, FALSE
)

If my understanding is correct, then there is a benefit to be gained by writing the expression like this:

SELECT(
  Orders[Order ID],
  IFS(
    NOT([Order Statue] = “Open”), FALSE,
    NOT([Order Year] = 2016), FALSE,
    TRUE, TRUE
  )
)

With the benefit growing with the number of rows in the table.

In this simple case, I wouldn’t think so.

The BIG wins come when your subexpressions include queries themselves, because a short-circuit before the query avoids the query entirely. Consider this:

OR(
  ISBLANK(value),
  (value = LOOKUP(...)
)

With OR(), BOTH expressions are evaluated, even if the ISBLANK() subexpression is TRUE. If ISBLANK(...) is true, the OR() will be TRUE regardless the result of the LOOKUP() subexpression, so there’s no win in doing it, but a loss in the performance hit.

Alternatively with IFS():

IFS(
  ISBLANK(value), TRUE,
  (value = LOOKUP(...), TRUE,
  TRUE, FALSE
)

the LOOKUP() isn’t performed at all if the ISBLANK() is TRUE.

View solution in original post

Steve

I can only speculate.

As I understand it, the AND() and OR() functions do NOT short-circuit, so the order of the subexpressions has ZERO effect on performance. In fact, using AND() and OR() would therefore always deliver worst-case performance.

Conversely (as I understand it), IF(), IFS(), and SWITCH() can be used to short-circuit evaluation, and therefore COULD have a performance benefit.

Therefore this:

AND(
  expr1,
  expr2,
  ...,
  exprN
)

might be better expressed as:

IFS(
  NOT( expr1 ), FALSE,
  NOT( expr2 ), FALSE,
  ...,
  NOT( exprN ), FALSE,
  TRUE, TRUE
)

And this:

OR(
  expr1,
  expr2,
  ...,
  exprN
)

might be better expressed as:

IFS(
  expr1, TRUE,
  expr2, TRUE,
  ...,
  exprN, TRUE,
  TRUE, FALSE
)

If my understanding is correct, then there is a benefit to be gained by writing the expression like this:

SELECT(
  Orders[Order ID],
  IFS(
    NOT([Order Statue] = “Open”), FALSE,
    NOT([Order Year] = 2016), FALSE,
    TRUE, TRUE
  )
)

With the benefit growing with the number of rows in the table.

In this simple case, I wouldn’t think so.

The BIG wins come when your subexpressions include queries themselves, because a short-circuit before the query avoids the query entirely. Consider this:

OR(
  ISBLANK(value),
  (value = LOOKUP(...)
)

With OR(), BOTH expressions are evaluated, even if the ISBLANK() subexpression is TRUE. If ISBLANK(...) is true, the OR() will be TRUE regardless the result of the LOOKUP() subexpression, so there’s no win in doing it, but a loss in the performance hit.

Alternatively with IFS():

IFS(
  ISBLANK(value), TRUE,
  (value = LOOKUP(...), TRUE,
  TRUE, FALSE
)

the LOOKUP() isn’t performed at all if the ISBLANK() is TRUE.

Suvrutt_Gurjar

Thank you very much @Steve for your very insightful response. I appreciate. This really helps.

I will try to run some tests in the coming days based on your guidelines and share my observations if any.

Steve

I would be very interested in those results!!!

Suvrutt_Gurjar

Sure @Steve.

Suvrutt_Gurjar

Hi @Steve ,

Following are some observations.

Test Data Set: Public Data Set of World Cup Soccer games. Total Rows 37784 ( Downloaded from Kaggle.com)
Table’s Data Columns:

Nomenclature for VCs

The VC with the following format expression is called "AND Simple VC "

The expression used in the VC
SELECT(SWCPlayers[Player Name],
AND([Coach Name]=USERSETTINGS(Coach Name), [MatchID]=USERSETTINGS(Match ID)))

The VC with the following format expression is called " AND IFS VC "

The expression used in the VC

SELECT(SWCPlayers[Player Name],

IFS( NOT([Coach Name]=USERSETTINGS(Coach Name)), FALSE,
NOT(USERSETTINGS(Match ID)=[MatchID]), FALSE,
TRUE, TRUE))

The VC with the following format expression is called "OR Simple VC "

The expression used in the VC
SELECT(SWCPlayers[Player Name], OR([Coach Name]=USERSETTINGS(Coach Name), [MatchID]=USERSETTINGS(Match ID)))

The VC with the following format expression is called " OR IFS VC "

The expression used in the VC
SELECT(SWCPlayers[Player Name], IFS([Coach Name]=USERSETTINGS(Coach Name), TRUE, [MatchID]=USERSETTINGS(Match ID), TRUE, TRUE, FALSE))

The execution times were as in the table below. In each of the records, the sync was performed on the entire set of 38K rows. The column “SELECT Arguments’ Approximate Rows Position” denotes where most of the satisfying conditions for arguments were likely to be there for that execution.

Observations

The AND() condition expressions were taking lesser time than OR() conditions for the same arguments.
Maybe I am missing some obvious points, however, IFS() expressions in most cases were taking almost the same or even more time than simple AND or OR Conditions. Only those cells, highlighted in light green had IFS() expressions run faster than simple AND()
The cells highlighted in light pink were taking lesser time, even though the rows satisfying the arguments mostly were at a higher row count.

i will be willing to test more, if you have suggestions. I might have missed some obvious point as well.

Steve

Great stuff!

Your results actually suggest to me that short-circuit evaluation has been added to AND() and OR(), which would be great news!

Thanks so much for doing this performance testing!

Suvrutt_Gurjar

You are welcome and thank you @Steve

I will be always glad to be of help in any such testing.

Yes, the AND() and OR() seems to be taking into account short-circuiting.

Another observation, even though not very conclusive and maybe even incorrect, was that if the records satisfying conditions of SELECT() are at the beginning of the table, the execution time tended to be lesser. Of course, there were many exceptions to this observation.

Koichi_Tsuji

Thanks you @Suvrutt_Gurjar and @Steve for this interesting discussions.
Short-circuit was the word I first time to hear and learn, yes it should logically affect the performance of appsheet expression.

To add something here to this dicussion, I suppose the type of data base should be one of the factor to be in mind. This is not a story of expression itself thoguh.

For instance, SQL.

When we add INDEX to the columns which will be queried from Appsheet side (my assumption is appsheet expression will be converted to sql syntax to send query to SQL?), the performance is said to be improve, as query performance is naturally getting better and more efficient.
Wondering if Google sheet could implement like indexing kinds of things, then our app built on sheet will perform better?

Suvrutt_Gurjar

Hi @tsuji_koichi ,

Thank you for joining the discussion. Your insights are always valuable. Your suggestion on databases and indexes is very valid.

Sync performance is indeed a multifaceted subject - databases/spreadsheet to the data provider and even network speed to optimally written expressions.

Started this particular thread to request guidance on how multi-row expressions can be more optimally written if we know the row-wise processing sequence of those expressions.

I will request your insights. I am also performing some more tests, purely on multirow expressions, slices, etc., and will share useful observations, if any.

Koichi_Tsuji

Majorities of Appsheet app could be using Google Spreadsheet.
Not sure how it is going to be difficult and challenging to introdue “apply indexing” to the selected column within google sheet, but if we could, it is going to improve appsheet app performance in general?
This should be subject for backend engineers … I have nil knowledge.

In the past, i migrated app with large data set from google sheet to ms sql and got advice from appsheet team to apply index to fields used for query, and it evenually improved the performance. Wondering it can be done with googlesheet…

@praveen

Multirows Functions and Expressions- Processing Sequence and Sync Time Optimization