<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Finding multiple substrings from a DataFrame column dynamically? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/finding-multiple-substrings-from-a-dataframe-column-dynamically/m-p/13759#M8383</link>
    <description>&lt;P&gt;Hello Melbourne,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It worked with the expression you suggested. I had to just make small changes to final solution, but it is giving me the output I wanted.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I used - &lt;/P&gt;&lt;P&gt;str1[str(idxi)+14:str(idxi)+17] as my expression to get specific values in the function and applied map to get necessary output.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Tejas Parlikar&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 18 Jul 2022 02:16:27 GMT</pubDate>
    <dc:creator>Tejas1987</dc:creator>
    <dc:date>2022-07-18T02:16:27Z</dc:date>
    <item>
      <title>Finding multiple substrings from a DataFrame column dynamically?</title>
      <link>https://community.databricks.com/t5/data-engineering/finding-multiple-substrings-from-a-dataframe-column-dynamically/m-p/13757#M8381</link>
      <description>&lt;P&gt;Hello friends,&lt;/P&gt;&lt;P&gt;I have a DataFrame with specific values. I am trying to find specific values out of it.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;*I/P -&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;|ID | text |&lt;/P&gt;&lt;P&gt;|:--|:------|&lt;/P&gt;&lt;P&gt;|1&amp;nbsp;| select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%ABC%') OR (Col3 Like '%DEF%') OR (Col3 Like '%EFG%'))|&lt;/P&gt;&lt;P&gt;|2&amp;nbsp;| select distinct Col1 as OrderID from Table1 WHERE (Col2 = 1234) AND( (Col3 Like '%XYZ%') OR (Col3 Like '%PQR%'))|&lt;/P&gt;&lt;P&gt;|3&amp;nbsp;| select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%MNO%') OR (Col3 Like '%PQR%'))|&lt;/P&gt;&lt;P&gt;|4&amp;nbsp;| select distinct Col1 as OrderID, 1 as Duration, 1 as Load from Table1 WHERE (Col3 Like '%PQR%')|&lt;/P&gt;&lt;P&gt;|5&amp;nbsp;| select distinct Col1 as OrderID from Table1 WHERE ( (Col4 = 'AA') OR (Col4 = 'BB') OR (Col4 = 'CC') OR (Col4 = 'DD')) AND (Col3 Like '%XYZ%')|&lt;/P&gt;&lt;P&gt;|6&amp;nbsp;| select distinct Col1 as OrderID from Table1 WHERE (Col1 = 1234)|&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;**O/P -&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;|ID | text_codes |&lt;/P&gt;&lt;P&gt;|:--|:------------|&lt;/P&gt;&lt;P&gt;|1&amp;nbsp;| ['ABC', 'DEF', 'EFG'] |&lt;/P&gt;&lt;P&gt;|2&amp;nbsp;| ['XYZ', 'PQR'] |&lt;/P&gt;&lt;P&gt;|3&amp;nbsp;| ['MNO', 'PQR'] |&lt;/P&gt;&lt;P&gt;|4&amp;nbsp;| ['PQR'] |&lt;/P&gt;&lt;P&gt;|5&amp;nbsp;| ['XYZ'] |&lt;/P&gt;&lt;P&gt;|6&amp;nbsp;| [] |&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;steps I have tried -&amp;nbsp;&lt;/P&gt;&lt;P&gt;I got indexes of each code based on "Col3" keyword plus specific number of words, so intermediate output is&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;* O/P -&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;|ID | codes_idx |&lt;/P&gt;&lt;P&gt;|:--|:----------|&lt;/P&gt;&lt;P&gt;|1&amp;nbsp;| 71;94;117 |&lt;/P&gt;&lt;P&gt;|2&amp;nbsp;| 88;111 |&lt;/P&gt;&lt;P&gt;|3&amp;nbsp;| 71;94 |&lt;/P&gt;&lt;P&gt;|4&amp;nbsp;| 95 |&lt;/P&gt;&lt;P&gt;|5&amp;nbsp;| 141 |&lt;/P&gt;&lt;P&gt;|6&amp;nbsp;|&amp;nbsp;|&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Code for it -&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;    def toString(x):
&amp;nbsp;
    	l = list(x)
&amp;nbsp;
    	ll = [str(i) for i in l]
&amp;nbsp;
    	return ';'.join(ll)
&amp;nbsp;
     
&amp;nbsp;
    def getCodes(str1):
&amp;nbsp;
    	findStr = '(Col3 Like '
&amp;nbsp;
    	res = [i+14 for i in range(len(str1)) if str1.startswith(findStr, i) ]
&amp;nbsp;
    	return toString(res)
&amp;nbsp;
     
&amp;nbsp;
    result = df2_query1.rdd.map(lambda x: (x[0], getCodes(x[0]) ) )
&amp;nbsp;
    result.collect()
&amp;nbsp;
    df_result = spark.createDataFrame(result)
&amp;nbsp;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But when I tried to get substring function applied here.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;    res_list = []
&amp;nbsp;
     
&amp;nbsp;
    for i in range(len(str1)):
&amp;nbsp;
    	idxi = str1.startswith(findStr, i)
&amp;nbsp;
    	res_str = substring(str1,idxi+14,3)
&amp;nbsp;
    	res_list.append(res_str)
&amp;nbsp;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I get Attribute Error -&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 64.0 failed 4 times, most recent failure: Lost task 0.3 in stage 64.0 (TID 118) (10.139.64.5 executor 1): org.apache.spark.api.python.PythonException: &amp;amp;#39;AttributeError: &amp;amp;#39;NoneType&amp;amp;#39; object has no attribute &amp;amp;#39;_jvm&amp;amp;#39;&amp;amp;#39;, from &amp;amp;lt;command-1021210642233982&amp;amp;gt;, line 9. Full traceback below:&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am not sure what I am doing wrong here. Any help is appreciated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Tejas&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jul 2022 20:46:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/finding-multiple-substrings-from-a-dataframe-column-dynamically/m-p/13757#M8381</guid>
      <dc:creator>Tejas1987</dc:creator>
      <dc:date>2022-07-12T20:46:46Z</dc:date>
    </item>
    <item>
      <title>Re: Finding multiple substrings from a DataFrame column dynamically?</title>
      <link>https://community.databricks.com/t5/data-engineering/finding-multiple-substrings-from-a-dataframe-column-dynamically/m-p/13758#M8382</link>
      <description>&lt;P&gt;What is the logic for substring function?&lt;/P&gt;&lt;P&gt;Can't you use str1[idxi+14:3] for substring?&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jul 2022 10:30:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/finding-multiple-substrings-from-a-dataframe-column-dynamically/m-p/13758#M8382</guid>
      <dc:creator>AmanSehgal</dc:creator>
      <dc:date>2022-07-14T10:30:40Z</dc:date>
    </item>
    <item>
      <title>Re: Finding multiple substrings from a DataFrame column dynamically?</title>
      <link>https://community.databricks.com/t5/data-engineering/finding-multiple-substrings-from-a-dataframe-column-dynamically/m-p/13759#M8383</link>
      <description>&lt;P&gt;Hello Melbourne,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It worked with the expression you suggested. I had to just make small changes to final solution, but it is giving me the output I wanted.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I used - &lt;/P&gt;&lt;P&gt;str1[str(idxi)+14:str(idxi)+17] as my expression to get specific values in the function and applied map to get necessary output.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Tejas Parlikar&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Jul 2022 02:16:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/finding-multiple-substrings-from-a-dataframe-column-dynamically/m-p/13759#M8383</guid>
      <dc:creator>Tejas1987</dc:creator>
      <dc:date>2022-07-18T02:16:27Z</dc:date>
    </item>
  </channel>
</rss>

