Extract some text from List

I use OCRTEXT() to read an Image and get this result:

SomeText, 6BT6613, SA 415+416, 1513.56.338, 1513.56.350, 1513.56.351, SomeText, 1513.51.310, SA 372, 1513.56.358, 1513.56.344, 330 mm, 1513.56.360, 1513.56.424, 1513.56.426, 1513.56.425, Another Text: 1999, HelloWorld,

This can easily be converted to a LIST, because it has the comma as a Item separator.
(This is, when your OCRTEXT() column is with format TEXT. LONGTEXT would not produce comma separators.)

My goal is, to have a LIST with only the numbers that start with 1513. Like:

  1. 1513.56.338
  2. 1513.56.350
  3. 1513.56.351
  4. 1513.51.310
    and so on.

So I have to delete the Text values before, after and in between.
Important: the sort order must stay the same.

Does anyone has a hint?

One workaround I’m thinking of is Multiple SUBSTITUTE().
–> Substitute every text with β€œβ€

But the only expression I found is:

SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE([OCRTEXT],"a",""),"b",""),"c",""),"d",""),"e",""),"f",""),"g",""),"h",""),"i",""),"j",""),"k",""),"l",""),"m",""),"n",""),"o",""),"p",""),"q",""),"r",""),"s",""),"t",""),"u",""),"v",""),"w",""),"x",""),"y",""),"z",""),"A",""),"B",""),"C",""),"D",""),"E",""),"F",""),"G",""),"H",""),"I",""),"J",""),"K",""),"L",""),"M",""),"N",""),"O",""),"P",""),"Q",""),"R",""),"S",""),"T",""),"U",""),"V",""),"W",""),"X",""),"Y",""),"Z",""),"Γ€",""),"ΓΆ",""),"ΓΌ",""),"Γ„",""),"Γ–",""),"Ü",""),"ß",""),":",""),"-",""),"!",""),"+","")

Is there a more easy way?

I found a solution.
With this I tell AppSheet to change the Item separator from β€œ,” to β€œ#”, but only for the 1513.00.000 Numbers.
After that I use INDEX() to add one by one and to truncate each Number to 12 digits.
I start with INDEX(…,2) because the first one is unwanted Text.
I go up till INDEX(…,20) because I think there will be not more than that.
Finally I delete empty entries with -LIST("").
This is the result:

1513.56.338, 1513.56.350, 1513.56.351, 1513.51.310, 1513.56.358, 1513.56.344, 1513.56.360, 1513.56.424, 1513.56.426, 1513.56.425

And this is my expression:

LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,2)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,3)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,4)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,5)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,6)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,7)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,8)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,9)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,10)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,11)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,12)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,13)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,14)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,15)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,16)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,17)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,18)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,19)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,20)
,12)
)
-
LIST("")
2 Likes

In situations like this one, you normally can sort the list of text which would place all your similar and desired text together. You would then find starting and/or ending locations and perform some simple match based on the total text length to extract your information from the sorted list.

For example, in your case you could NORMALLY apply a function like this:

MID(
         SORT( SPLIT([OCR Text], ",")),
         FIND("1513", SORT( SPLIT([OCR Text], ","))),
         LEN([OCR Text]) 
                  -  FIND("1513", SORT( SPLIT([OCR Text], ","))) 
                  - FIND("1513", SORT( SPLIT([OCR Text], ","), true))
                  + 2 
)

This would get your text and then you can dice it up into lists, etc.


HOWEVER, I am finding an issue with sorting that prevents successfully using the above expression.

For example, this text

 **`SomeText, 123.124, 1513.56.338, 6BT6613, SA 415+416'**

When using SPLIT() and then sorting ascending, the list order is:

 `6BT6613 , 123.124 , 1513.56.338 , SA415+416 , SomeText`

But when SPLIT() and sorted descending the list order is:

 `SomeText , SA415+416 , 6BT6613 , 1513.56.338 , 123.124`

Note the placement of the value β€œ6BT6613”

I know that when sorting non-homogenous data some unexpected things can happen. But in this case I can’t think of any reason why the sorting would be like this!!

2 Likes

Thank you John,
your solution looks great.
Sadly it can’t use it in my case, because the order of the numbers has to stay the same. So I can’t use any kind of SORT().