Extract some text from List

I use OCRTEXT() to read an Image and get this result:

SomeText, 6BT6613, SA 415+416, 1513.56.338, 1513.56.350, 1513.56.351, SomeText, 1513.51.310, SA 372, 1513.56.358, 1513.56.344, 330 mm, 1513.56.360, 1513.56.424, 1513.56.426, 1513.56.425, Another Text: 1999, HelloWorld,

This can easily be converted to a LIST, because it has the comma as a Item separator.
(This is, when your OCRTEXT() column is with format TEXT. LONGTEXT would not produce comma separators.)

My goal is, to have a LIST with only the numbers that start with 1513. Like:

  1. 1513.56.338
  2. 1513.56.350
  3. 1513.56.351
  4. 1513.51.310
    and so on.

So I have to delete the Text values before, after and in between.
Important: the sort order must stay the same.

Does anyone has a hint?

Solved Solved
0 4 968
1 ACCEPTED SOLUTION

I found a solution.
With this I tell AppSheet to change the Item separator from “,” to “#”, but only for the 1513.00.000 Numbers.
After that I use INDEX() to add one by one and to truncate each Number to 12 digits.
I start with INDEX(…,2) because the first one is unwanted Text.
I go up till INDEX(…,20) because I think there will be not more than that.
Finally I delete empty entries with -LIST("").
This is the result:

1513.56.338, 1513.56.350, 1513.56.351, 1513.51.310, 1513.56.358, 1513.56.344, 1513.56.360, 1513.56.424, 1513.56.426, 1513.56.425

And this is my expression:

LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,2)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,3)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,4)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,5)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,6)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,7)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,8)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,9)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,10)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,11)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,12)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,13)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,14)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,15)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,16)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,17)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,18)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,19)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,20)
,12)
)
-
LIST("")

View solution in original post

4 REPLIES 4

One workaround I’m thinking of is Multiple SUBSTITUTE().
–> Substitute every text with “”

But the only expression I found is:

SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE([OCRTEXT],"a",""),"b",""),"c",""),"d",""),"e",""),"f",""),"g",""),"h",""),"i",""),"j",""),"k",""),"l",""),"m",""),"n",""),"o",""),"p",""),"q",""),"r",""),"s",""),"t",""),"u",""),"v",""),"w",""),"x",""),"y",""),"z",""),"A",""),"B",""),"C",""),"D",""),"E",""),"F",""),"G",""),"H",""),"I",""),"J",""),"K",""),"L",""),"M",""),"N",""),"O",""),"P",""),"Q",""),"R",""),"S",""),"T",""),"U",""),"V",""),"W",""),"X",""),"Y",""),"Z",""),"ä",""),"ö",""),"ü",""),"Ä",""),"Ö",""),"Ü",""),"ß",""),":",""),"-",""),"!",""),"+","")

Is there a more easy way?

I found a solution.
With this I tell AppSheet to change the Item separator from “,” to “#”, but only for the 1513.00.000 Numbers.
After that I use INDEX() to add one by one and to truncate each Number to 12 digits.
I start with INDEX(…,2) because the first one is unwanted Text.
I go up till INDEX(…,20) because I think there will be not more than that.
Finally I delete empty entries with -LIST("").
This is the result:

1513.56.338, 1513.56.350, 1513.56.351, 1513.51.310, 1513.56.358, 1513.56.344, 1513.56.360, 1513.56.424, 1513.56.426, 1513.56.425

And this is my expression:

LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,2)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,3)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,4)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,5)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,6)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,7)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,8)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,9)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,10)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,11)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,12)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,13)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,14)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,15)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,16)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,17)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,18)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,19)
,12)
)
+
LIST(
LEFT(
INDEX(
SPLIT(
SUBSTITUTE([OCRTEXT],", 1513.","# 1513.")
,"#")
,20)
,12)
)
-
LIST("")

In situations like this one, you normally can sort the list of text which would place all your similar and desired text together. You would then find starting and/or ending locations and perform some simple match based on the total text length to extract your information from the sorted list.

For example, in your case you could NORMALLY apply a function like this:

MID(
         SORT( SPLIT([OCR Text], ",")),
         FIND("1513", SORT( SPLIT([OCR Text], ","))),
         LEN([OCR Text]) 
                  -  FIND("1513", SORT( SPLIT([OCR Text], ","))) 
                  - FIND("1513", SORT( SPLIT([OCR Text], ","), true))
                  + 2 
)

This would get your text and then you can dice it up into lists, etc.


HOWEVER, I am finding an issue with sorting that prevents successfully using the above expression.

For example, this text

 **`SomeText, 123.124, 1513.56.338, 6BT6613, SA 415+416'**

When using SPLIT() and then sorting ascending, the list order is:

 `6BT6613 , 123.124 , 1513.56.338 , SA415+416 , SomeText`

But when SPLIT() and sorted descending the list order is:

 `SomeText , SA415+416 , 6BT6613 , 1513.56.338 , 123.124`

Note the placement of the value “6BT6613”

I know that when sorting non-homogenous data some unexpected things can happen. But in this case I can’t think of any reason why the sorting would be like this!!

Thank you John,
your solution looks great.
Sadly it can’t use it in my case, because the order of the numbers has to stay the same. So I can’t use any kind of SORT().

Top Labels in this Space