Finding multiple substrings from a DataFrame column dynamically?

Tejas1987 — Tue, 12 Jul 2022 20:46:46 GMT

Hello friends,

I have a DataFrame with specific values. I am trying to find specific values out of it.

*I/P -

|ID | text |

|:--|:------|

|1 | select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%ABC%') OR (Col3 Like '%DEF%') OR (Col3 Like '%EFG%'))|

|2 | select distinct Col1 as OrderID from Table1 WHERE (Col2 = 1234) AND( (Col3 Like '%XYZ%') OR (Col3 Like '%PQR%'))|

|3 | select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%MNO%') OR (Col3 Like '%PQR%'))|

|4 | select distinct Col1 as OrderID, 1 as Duration, 1 as Load from Table1 WHERE (Col3 Like '%PQR%')|

|5 | select distinct Col1 as OrderID from Table1 WHERE ( (Col4 = 'AA') OR (Col4 = 'BB') OR (Col4 = 'CC') OR (Col4 = 'DD')) AND (Col3 Like '%XYZ%')|

|6 | select distinct Col1 as OrderID from Table1 WHERE (Col1 = 1234)|

**O/P -

|ID | text_codes |

|:--|:------------|

|1 | ['ABC', 'DEF', 'EFG'] |

|2 | ['XYZ', 'PQR'] |

|3 | ['MNO', 'PQR'] |

|4 | ['PQR'] |

|5 | ['XYZ'] |

|6 | [] |

steps I have tried -

I got indexes of each code based on "Col3" keyword plus specific number of words, so intermediate output is

* O/P -

|ID | codes_idx |

|:--|:----------|

|1 | 71;94;117 |

|2 | 88;111 |

|3 | 71;94 |

|4 | 95 |

|5 | 141 |

|6 | |

Code for it -

    def toString(x):
 
    	l = list(x)
 
    	ll = [str(i) for i in l]
 
    	return ';'.join(ll)
 
     
 
    def getCodes(str1):
 
    	findStr = '(Col3 Like '
 
    	res = [i+14 for i in range(len(str1)) if str1.startswith(findStr, i) ]
 
    	return toString(res)
 
     
 
    result = df2_query1.rdd.map(lambda x: (x[0], getCodes(x[0]) ) )
 
    result.collect()
 
    df_result = spark.createDataFrame(result)

But when I tried to get substring function applied here.

    res_list = []
 
     
 
    for i in range(len(str1)):
 
    	idxi = str1.startswith(findStr, i)
 
    	res_str = substring(str1,idxi+14,3)
 
    	res_list.append(res_str)

I get Attribute Error -

    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 64.0 failed 4 times, most recent failure: Lost task 0.3 in stage 64.0 (TID 118) (10.139.64.5 executor 1): org.apache.spark.api.python.PythonException: &#39;AttributeError: &#39;NoneType&#39; object has no attribute &#39;_jvm&#39;&#39;, from &lt;command-1021210642233982&gt;, line 9. Full traceback below:

I am not sure what I am doing wrong here. Any help is appreciated.

Thanks,

Tejas

Re: Finding multiple substrings from a DataFrame column dynamically?

AmanSehgal — Thu, 14 Jul 2022 10:30:40 GMT

What is the logic for substring function?

Can't you use str1[idxi+14:3] for substring?

Re: Finding multiple substrings from a DataFrame column dynamically?

Tejas1987 — Mon, 18 Jul 2022 02:16:27 GMT

Hello Melbourne,

It worked with the expression you suggested. I had to just make small changes to final solution, but it is giving me the output I wanted.

I used -

str1[str(idxi)+14:str(idxi)+17] as my expression to get specific values in the function and applied map to get necessary output.

Thank you.

Regards,

Tejas Parlikar

topic Re: Finding multiple substrings from a DataFrame column dynamically? in Data Engineering

Finding multiple substrings from a DataFrame column dynamically?

Re: Finding multiple substrings from a DataFrame column dynamically?

Re: Finding multiple substrings from a DataFrame column dynamically?