Python Pandas - String and Regular Expression

17 minute read

Exercise - Pandas String and Regular Expression

import pandas as pd
import numpy as np
import re
  1. Write a Pandas program to convert all the string values to upper, lower cases in a given pandas series. Also find the length of the string values.
s1 = pd.Series(['X', 'Y', 'Z', 'Aaba', 'Baca', np.nan, 'CABA', None, 'bird', 'horse', 'dog'])
s1.str.upper()
0         X
1         Y
2         Z
3      AABA
4      BACA
5       NaN
6      CABA
7      None
8      BIRD
9     HORSE
10      DOG
dtype: object
s1.str.lower()
0         x
1         y
2         z
3      aaba
4      baca
5       NaN
6      caba
7      None
8      bird
9     horse
10      dog
dtype: object
s1.str.len()
0     1.0
1     1.0
2     1.0
3     4.0
4     4.0
5     NaN
6     4.0
7     NaN
8     4.0
9     5.0
10    3.0
dtype: float64
  1. Write a Pandas program to remove whitespaces, left sided whitespaces and right sided whitespaces of the string values of a given pandas series.
s2 = pd.Index([' Green', 'Black ', ' Red ', 'White', ' Pink '])
s2.str.strip()
Index(['Green', 'Black', 'Red', 'White', 'Pink'], dtype='object')
s2.str.lstrip()
Index(['Green', 'Black ', 'Red ', 'White', 'Pink '], dtype='object')
s2.str.rstrip()
Index([' Green', 'Black', ' Red', 'White', ' Pink'], dtype='object')
  1. Write a Pandas program to add leading zeros to the integer column in a pandas series and makes the length of the field to 8 digit.
df3=pd.DataFrame({'amount': [10, 250, 3000, 40000, 500000]})
#Solution 1
df3['amount'].apply(lambda x:'{0:0>8}'.format(x))
0    00000010
1    00000250
2    00003000
3    00040000
4    00500000
Name: amount, dtype: object
#Solution 2
df3['amount'].apply(lambda x:str(x).zfill(8))
0    00000010
1    00000250
2    00003000
3    00040000
4    00500000
Name: amount, dtype: object
  1. Write a Pandas program to add leading zeros to the character column in a pandas series and makes the length of the field to 8 digit.
df4 = pd.DataFrame({'amount': ['10', '250', '3000', '40000', '500000']})
#solution 1
df3.amount.apply(lambda x:'{0:0>8}'.format(x))
0    00000010
1    00000250
2    00003000
3    00040000
4    00500000
Name: amount, dtype: object
#Solution 2
df4.amount.apply(lambda x:x.zfill(8))
0    00000010
1    00000250
2    00003000
3    00040000
4    00500000
Name: amount, dtype: object
  1. Write a Pandas program to capitalize all the string values of specified columns of a given DataFrame.
df5 = pd.DataFrame({
    'name': ['alberto','gino','ryan', 'Eesha', 'syed'],
    'date_of_birth ': ['17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})
df5
name date_of_birth age
0 alberto 17/05/2002 18.5
1 gino 16/02/1999 21.2
2 ryan 25/09/1998 22.5
3 Eesha 11/05/2002 22.0
4 syed 15/09/1997 23.0
df5.name.str.capitalize()
0    Alberto
1       Gino
2       Ryan
3      Eesha
4       Syed
Name: name, dtype: object
  1. Write a Pandas program to count of occurrence of a specified substring in a DataFrame column.
df6 = pd.DataFrame({
    'name_code': ['c001','c002','c022', 'c2002', 'c2222'],
    'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})
df6
name_code date_of_birth age
0 c001 12/05/2002 18.5
1 c002 16/02/1999 21.2
2 c022 25/09/1998 22.5
3 c2002 12/02/2022 22.0
4 c2222 15/09/1997 23.0
print("\nCount occurrence of 2 in name_code column:")

df6['count']=df6.name_code.apply(lambda x:x.count('2'))
df6
Count occurrence of 2 in name_code column:
name_code date_of_birth age count
0 c001 12/05/2002 18.5 0
1 c002 16/02/1999 21.2 1
2 c022 25/09/1998 22.5 2
3 c2002 12/02/2022 22.0 2
4 c2222 15/09/1997 23.0 4
  1. Write a Pandas program to find the index of a given substring of a DataFrame column.
df7 = pd.DataFrame({
    'name_code': ['c001','c002','c022', 'c2002', 'c2222'],
    'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})
print("\nCount occurrence of 22 in name_code column:")
df7['Occur_22']=df7.name_code.apply(lambda x: 'Not Available' if x.find('22')==-1 else x.find('22'))
#df7['Occur_22'] = list(map(lambda x: x.find('22'), df7['name_code']))
df7
Count occurrence of 22 in name_code column:
name_code date_of_birth age Occur_22
0 c001 12/05/2002 18.5 Not Available
1 c002 16/02/1999 21.2 Not Available
2 c022 25/09/1998 22.5 2
3 c2002 12/02/2022 22.0 Not Available
4 c2222 15/09/1997 23.0 1
  1. Write a Pandas program to find the index of a substring of DataFrame within provided index boundary (beginning and end position).
df8 = pd.DataFrame({
    'name_code': ['c0001','1000c','b00c2', 'b2c02', 'c2222'],
    'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})
df8['occur_c_within_bound']=df8.name_code.apply(lambda x:x.find('c',0,4)) # 0 is staring index, 4 is end position
df8
name_code date_of_birth age occur_c_within_bound
0 c0001 12/05/2002 18.5 0
1 1000c 16/02/1999 21.2 -1
2 b00c2 25/09/1998 22.5 3
3 b2c02 12/02/2022 22.0 2
4 c2222 15/09/1997 23.0 0
  1. Write a Pandas program to check whether alpha numeric values present in a given column of a DataFrame.
df9 = pd.DataFrame({
    'name_code': ['Company','Company a001','Company 123', '1234', 'Company12'],
    'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})
df9['AphaNum_Flag']=df9.name_code.apply(lambda x:x.isalnum())
df9
name_code date_of_birth age AphaNum_Flag
0 Company 12/05/2002 18.5 True
1 Company a001 16/02/1999 21.2 False
2 Company 123 25/09/1998 22.5 False
3 1234 12/02/2022 22.0 True
4 Company12 15/09/1997 23.0 True
  1. Write a Pandas program to check whether only alphabetic values present in a given column of a DataFrame.
df10 = pd.DataFrame({
    'company_code': ['Company','Company a001','Company 123', 'abcd', 'Company 12'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df10['Alphabetic_flag']=df10.company_code.apply(lambda x: x.isalpha())
df10
company_code date_of_sale sale_amount Alphabetic_flag
0 Company 12/05/2002 12348.5 True
1 Company a001 16/02/1999 233331.2 False
2 Company 123 25/09/1998 22.5 False
3 abcd 12/02/2022 2566552.0 True
4 Company 12 15/09/1997 23.0 False
  1. Write a Pandas program to check whether only numeric values present in a given column of a DataFrame
df11 = pd.DataFrame({
    'company_code': ['Company','Company a001', '2055', 'abcd', '123345'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df11['Numeric_flag']=df11.company_code.apply(lambda x:x.isdigit())
df11
company_code date_of_sale sale_amount Numeric_flag
0 Company 12/05/2002 12348.5 False
1 Company a001 16/02/1999 233331.2 False
2 2055 25/09/1998 22.5 True
3 abcd 12/02/2022 2566552.0 False
4 123345 15/09/1997 23.0 True
  1. Write a Pandas program to check whether only lower case or upper case is present in a given column of a DataFrame.
df12 = pd.DataFrame({
    'company_code': ['ABCD','EFGF', 'hhhh', 'abcd', 'EAWQaaa'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})

df12['islower']=df12.company_code.apply(lambda x:(x.islower()))
df12['isupper']=df12.company_code.apply(lambda x:(x.isupper()))
df12
company_code date_of_sale sale_amount islower isupper
0 ABCD 12/05/2002 12348.5 False True
1 EFGF 16/02/1999 233331.2 False True
2 hhhh 25/09/1998 22.5 True False
3 abcd 12/02/2022 2566552.0 True False
4 EAWQaaa 15/09/1997 23.0 False False
  1. Write a Pandas program to check whether only proper case or title case is present in a given column of a DataFrame.
df13 = pd.DataFrame({
    'company_code': ['ABCD','EFGF', 'Hhhh', 'abcd', 'EAWQaaa'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})

df13['istitle']=df13.company_code.apply(lambda x:x.istitle())
df13
company_code date_of_sale sale_amount istitle
0 ABCD 12/05/2002 12348.5 False
1 EFGF 16/02/1999 233331.2 False
2 Hhhh 25/09/1998 22.5 True
3 abcd 12/02/2022 2566552.0 False
4 EAWQaaa 15/09/1997 23.0 False
  1. Write a Pandas program to check whether only space is present in a given column of a DataFrame
df14 = pd.DataFrame({
    'company_code': ['Abcd','EFGF ', '  ', 'abcd', ' '],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df14['isspace']=df14.company_code.apply(lambda x:x.isspace())
#Alternative
#df14['isspace']=df14.company_code.str.isspace()
df14
company_code date_of_sale sale_amount isspace
0 Abcd 12/05/2002 12348.5 False
1 EFGF 16/02/1999 233331.2 False
2 25/09/1998 22.5 True
3 abcd 12/02/2022 2566552.0 False
4 15/09/1997 23.0 True
  1. Write a Pandas program to get the length of the string present of a given column in a DataFrame.
df15 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'skfsalf', 'sdfslew', 'safsdf'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df15['length_company']=df15.company_code.str.len()
df15
company_code date_of_sale sale_amount length_company
0 Abcd 12/05/2002 12348.5 4
1 EFGF 16/02/1999 233331.2 4
2 skfsalf 25/09/1998 22.5 7
3 sdfslew 12/02/2022 2566552.0 7
4 safsdf 15/09/1997 23.0 6
  1. Write a Pandas program to get the length of the integer of a given column in a DataFrame.
df16 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'skfsalf', 'sdfslew', 'safsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df16['sale_amnt_len']=df16.sale_amount.astype('str').str.len()
df16
company_code date_of_sale sale_amount sale_amnt_len
0 Abcd 12/05/2002 12348.5 7
1 EFGF 16/02/1999 233331.2 8
2 skfsalf 25/09/1998 22.5 4
3 sdfslew 12/02/2022 2566552.0 9
4 safsdf 15/09/1997 23.0 4
  1. Write a Pandas program to check if a specified column starts with a specified string in a DataFrame.
df17 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df17['company_code_starts_with']=df17.company_code.str.startswith('ze')
df17
company_code date_of_sale sale_amount company_code_starts_with
0 Abcd 12/05/2002 12348.5 False
1 EFGF 16/02/1999 233331.2 False
2 zefsalf 25/09/1998 22.5 True
3 sdfslew 12/02/2022 2566552.0 False
4 zekfsdf 15/09/1997 23.0 True
  1. Write a Pandas program to swap the cases of a specified character column in a given DataFrame.
df18 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df18['swapcase_cc']=df18.company_code.str.swapcase()
df18
company_code date_of_sale sale_amount swapcase_cc
0 Abcd 12/05/2002 12348.5 aBCD
1 EFGF 16/02/1999 233331.2 efgf
2 zefsalf 25/09/1998 22.5 ZEFSALF
3 sdfslew 12/02/2022 2566552.0 SDFSLEW
4 zekfsdf 15/09/1997 23.0 ZEKFSDF
  1. Write a Pandas program to convert a specified character column in upper/lower cases in a given DataFrame.
df19 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df19['upper_cc']=df19.company_code.str.upper()
df19['lower_cc']=df19.company_code.str.lower()
df19
company_code date_of_sale sale_amount upper_cc lower_cc
0 Abcd 12/05/2002 12348.5 ABCD abcd
1 EFGF 16/02/1999 233331.2 EFGF efgf
2 zefsalf 25/09/1998 22.5 ZEFSALF zefsalf
3 sdfslew 12/02/2022 2566552.0 SDFSLEW sdfslew
4 zekfsdf 15/09/1997 23.0 ZEKFSDF zekfsdf
  1. Write a Pandas program to convert a specified character column in title case in a given DataFrame.
df20 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df20['title_cc']=df20.company_code.str.title()
df20
company_code date_of_sale sale_amount title_cc
0 Abcd 12/05/2002 12348.5 Abcd
1 EFGF 16/02/1999 233331.2 Efgf
2 zefsalf 25/09/1998 22.5 Zefsalf
3 sdfslew 12/02/2022 2566552.0 Sdfslew
4 zekfsdf 15/09/1997 23.0 Zekfsdf
  1. Write a Pandas program to replace arbitrary values with other values in a given DataFrame.
df21 = pd.DataFrame({
    'company_code': ['A','B', 'C', 'D', 'A'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
#replacing A with C
df21.replace('A','C')
company_code date_of_sale sale_amount
0 C 12/05/2002 12348.5
1 B 16/02/1999 233331.2
2 C 25/09/1998 22.5
3 D 12/02/2022 2566552.0
4 C 15/09/1997 23.0
  1. Write a Pandas program to replace more than one value with other values in a given DataFrame.
df22 = pd.DataFrame({
    'company_code': ['A','B', 'C', 'D', 'A'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df22.replace(['A','C'],['X','Y'])
company_code date_of_sale sale_amount
0 X 12/05/2002 12348.5
1 B 16/02/1999 233331.2
2 Y 25/09/1998 22.5
3 D 12/02/2022 2566552.0
4 X 15/09/1997 23.0
  1. Write a Pandas program to split a string of a column of a given DataFrame into multiple columns.
df23 = pd.DataFrame({
    'name': ['Alberto  Franco','Gino Ann Mcneill','Ryan  Parkes', 'Eesha Artur Hinton', 'Syed  Wharton'],
    'date_of_birth ': ['17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})
df23['First Middle Last'.split()]=df23.name.str.split(expand=True)
df23
name date_of_birth age First Middle Last
0 Alberto Franco 17/05/2002 18.5 Alberto Franco None
1 Gino Ann Mcneill 16/02/1999 21.2 Gino Ann Mcneill
2 Ryan Parkes 25/09/1998 22.5 Ryan Parkes None
3 Eesha Artur Hinton 11/05/2002 22.0 Eesha Artur Hinton
4 Syed Wharton 15/09/1997 23.0 Syed Wharton None
  1. Write a Pandas program to extract email from a specified column of string type of a given DataFrame
df24 = pd.DataFrame({
    'name_email': ['Alberto Franco af@gmail.com','Gino Mcneill gm@yahoo.com','Ryan Parkes rp@abc.io', 'Eesha Hinton', 'Gino Mcneill gm@github.com']
    })
df24['email']=df24.name_email.apply(lambda x:re.findall('\S+@+\S+',str(x)))
df24
name_email email
0 Alberto Franco af@gmail.com [af@gmail.com]
1 Gino Mcneill gm@yahoo.com [gm@yahoo.com]
2 Ryan Parkes rp@abc.io [rp@abc.io]
3 Eesha Hinton []
4 Gino Mcneill gm@github.com [gm@github.com]

df24['email2']=df24.name_email.apply(lambda x:','.join(re.findall(r'[\w\.-]+@[\w\.-]+',str(x))))
df24
name_email email email2
0 Alberto Franco af@gmail.com [af@gmail.com] af@gmail.com
1 Gino Mcneill gm@yahoo.com [gm@yahoo.com] gm@yahoo.com
2 Ryan Parkes rp@abc.io [rp@abc.io] rp@abc.io
3 Eesha Hinton []
4 Gino Mcneill gm@github.com [gm@github.com] gm@github.com
  1. Write a Pandas program to extract hash attached word from twitter text from the specified column of a given DataFrame.
df25 = pd.DataFrame({
    'tweets': ['#Obama says goodbye','Retweets for #cash','A political endorsement in #Indonesia', '1 dog = many #retweets', 'Just a simple #egg']
    })
df25['hashtags']=df25.tweets.apply(lambda x: ' '.join(re.findall(r'(?<=#)\w+',x)))
df25
tweets hashtags
0 #Obama says goodbye Obama
1 Retweets for #cash cash
2 A political endorsement in #Indonesia Indonesia
3 1 dog = many #retweets retweets
4 Just a simple #egg egg
  1. Write a Pandas program to extract word mention someone in tweets using @ from the specified column of a given DataFrame.
df26 = pd.DataFrame({
    'tweets': ['@Obama says goodbye','Retweets for @cash','A political endorsement in @Indonesia', '1 dog = many #retweets', 'Just a simple #egg']
    })
df26['at_words']=df26.tweets.apply(lambda x:' '.join(re.findall(r'(?<=@)\w+',x)))
df26
tweets at_words
0 @Obama says goodbye Obama
1 Retweets for @cash cash
2 A political endorsement in @Indonesia Indonesia
3 1 dog = many #retweets
4 Just a simple #egg
  1. Write a Pandas program to extract only number from the specified column of a given DataFrame.
df27 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['7277 Surrey Ave.','920 N. Bishop Ave.','9910 Golden Star St.', '25 Dunbar St.', '17 West Livingston Court']
    })
df27['numeric_in_address']=df27.address.apply(lambda x:','.join(re.findall(r'[0-9]+',x)))
df27
company_code address numeric_in_address
0 c0001 7277 Surrey Ave. 7277
1 c0002 920 N. Bishop Ave. 920
2 c0003 9910 Golden Star St. 9910
3 c0003 25 Dunbar St. 25
4 c0004 17 West Livingston Court 17
  1. Write a Pandas program to extract only phone number from the specified column of a given DataFrame.
df28 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'company_phone_no': ['Company1-Phone no. 4695168357','Company2-Phone no. 8088729013','Company3-Phone no. 6204658086', 'Company4-Phone no. 5159530096', 'Company5-Phone no. 9037952371']
    })
re.findall(r'\b\d{10}\b',df28.company_phone_no[1])
['8088729013']
df28['phone_number']=df28.company_phone_no.apply(lambda x:','.join(re.findall(r'\b\d{10}\b',x)))
df28
company_code company_phone_no phone_number
0 c0001 Company1-Phone no. 4695168357 4695168357
1 c0002 Company2-Phone no. 8088729013 8088729013
2 c0003 Company3-Phone no. 6204658086 6204658086
3 c0003 Company4-Phone no. 5159530096 5159530096
4 c0004 Company5-Phone no. 9037952371 9037952371
  1. Write a Pandas program to extract year between 1800 to 2200 from the specified column of a given DataFrame.
df29 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
    })
df29.year.str.split(expand=True)
0 1
0 year 1800
1 year 1700
2 year 2300
3 year 1900
4 year 2200
re.findall(r'\b([1][8-9]+\d{02}|[2][1]\d{2}|2200)\b','year 1800 year 1700 year 2300 year 1900 year 2200 year 1711')
['1800', '1900', '2200']
df29['year_btw']=df29.year.apply(lambda x:','.join(re.findall(r'\b([1][8-9]+\d{02}|[2][1]\d{2}|2200)\b',x)))
df29
company_code year year_btw
0 c0001 year 1800 1800
1 c0002 year 1700
2 c0003 year 2300
3 c0003 year 1900 1900
4 c0004 year 2200 2200
  1. Write a Pandas program to extract only non alphanumeric characters from the specified column of a given DataFrame.
df30 = pd.DataFrame({
    'company_code': ['c0001#','c00@0^2','$c0003', 'c0003', '&c0004'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
    })
re.findall(r'[^A-Za-z0-9]','c00@0^2')
['@', '^']
df30['special_char']=df30.company_code.apply(lambda x:','.join(re.findall(r'[^a-z0-9]',x)))
df30
company_code year special_char
0 c0001# year 1800 #
1 c00@0^2 year 1700 @,^
2 $c0003 year 2300 $
3 c0003 year 1900
4 &c0004 year 2200 &
  1. Write a Pandas program to extract only punctuations from the specified column of a given DataFrame.
df31 = pd.DataFrame({
    'company_code': ['c0001.','c000,2','c0003', 'c0003#', 'c0004,'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
    })
df31['puct']=df31.company_code.apply(lambda x:re.findall(r'[^\w\s]',x))
df31
company_code year puct
0 c0001. year 1800 [.]
1 c000,2 year 1700 [,]
2 c0003 year 2300 []
3 c0003# year 1900 [#]
4 c0004, year 2200 [,]
  1. Write a Pandas program to remove repetitive characters from the specified column of a given DataFrame.
df32 = pd.DataFrame({
    'text_code': ['t0001.','t0002','t0003', 't0004'],
    'text_lang': ['She livedd a long life.', 'How oold is your father?', 'What is tthe problem?','TThhis desk is used by Tom.']
    })

def rep_char(str1):
    tchr = str1.group(0)
    if len(tchr) > 1:
        return tchr[0:1] # can change the value here on repetition
def unique_char(rep, sent_text):
    convert = re.sub(r'(\w)\1+', rep, sent_text)
    return convert
df32['normal_text']=df32['text_lang'].apply(lambda x : unique_char(rep_char,x))
df32
text_code text_lang normal_text
0 t0001. She livedd a long life. She lived a long life.
1 t0002 How oold is your father? How old is your father?
2 t0003 What is tthe problem? What is the problem?
3 t0004 TThhis desk is used by Tom. This desk is used by Tom.
  1. Write a Pandas program to extract numbers greater than 940 from the specified column of a given DataFrame.
df33 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['7277 Surrey Ave.1111','920 N. Bishop Ave.','9910 Golden Star St.', '1025 Dunbar St.', '1700 West Livingston Court']
    })
df33
company_code address
0 c0001 7277 Surrey Ave.1111
1 c0002 920 N. Bishop Ave.
2 c0003 9910 Golden Star St.
3 c0003 1025 Dunbar St.
4 c0004 1700 West Livingston Court
df33['num_great']=df33.address.apply(lambda x:','.join(re.findall(r'([9][4-9]\d|[1-9]\d{3})',x)))
df33
company_code address num_great
0 c0001 7277 Surrey Ave.1111 7277,1111
1 c0002 920 N. Bishop Ave.
2 c0003 9910 Golden Star St. 991
3 c0003 1025 Dunbar St. 1025
4 c0004 1700 West Livingston Court 1700
  1. Write a Pandas program to extract numbers less than 100 from the specified column of a given DataFrame.
df34 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['72 Surrey Ave.11','92 N. Bishop Ave.','9910 Golden Star St.', '102 Dunbar St.', '17 West Livingston Court']
    })
df34['num_less']=df34.address.apply(lambda x:re.findall(r'(\b\d{2}\b)',x))
df34
company_code address num_less
0 c0001 72 Surrey Ave.11 [72, 11]
1 c0002 92 N. Bishop Ave. [92]
2 c0003 9910 Golden Star St. []
3 c0003 102 Dunbar St. []
4 c0004 17 West Livingston Court [17]
  1. Write a Pandas program to check whether two given words present in a specified column of a given DataFrame.
df35 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['9910 Surrey Ave.','92 N. Bishop Ave.','9910 Golden Star Ave.', '102 Dunbar St.', '17 West Livingston Court']
    })
df35['check_words']=df35.address.apply(lambda x:re.findall(r'(?=.*Ave.)(?=.*9910).*',x))
df35
company_code address check_words
0 c0001 9910 Surrey Ave. [9910 Surrey Ave.]
1 c0002 92 N. Bishop Ave. []
2 c0003 9910 Golden Star Ave. [9910 Golden Star Ave.]
3 c0003 102 Dunbar St. []
4 c0004 17 West Livingston Court []
  1. Write a Pandas program to extract valid date (format: mm-dd-yyyy) from a given column of a given DataFrame.
df36 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df36['valid_dates']=df36.date_of_sale.apply(lambda x:re.findall(r'(1[0-2]|0[1-9])/(3[0-1]|[1-2][0-9]|0\d)/(\d{4})',x))
df36
company_code date_of_sale sale_amount valid_dates
0 Abcd 12/05/2002 12348.5 [(12, 05, 2002)]
1 EFGF 16/02/1999 233331.2 []
2 zefsalf 05/09/1998 22.5 [(05, 09, 1998)]
3 sdfslew 12/02/2022 2566552.0 [(12, 02, 2022)]
4 zekfsdf 15/09/1997 23.0 []
  1. Write a Pandas program to extract only words from a given column of a given DataFrame.
df37 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Ave.','92 N. Bishop Ave.','9910 Golden Star Ave.', '102 Dunbar St.', '17 West Livingston Court']
})
re.findall(r'\S+[a-zA-Z]+\S+','92 N. Bishop Ave.')
['Bishop', 'Ave.']
df37['only_words']=df37.address.apply(lambda x:' '.join(re.findall(r'\b[^\d\W]+\b',x)))
df37
company_code date_of_sale address only_words
0 Abcd 12/05/2002 9910 Surrey Ave. Surrey Ave
1 EFGF 16/02/1999 92 N. Bishop Ave. N Bishop Ave
2 zefsalf 05/09/1998 9910 Golden Star Ave. Golden Star Ave
3 sdfslew 12/02/2022 102 Dunbar St. Dunbar St
4 zekfsdf 15/09/1997 17 West Livingston Court West Livingston Court
  1. Write a Pandas program to extract the sentences where a specific word is present in a given column of a given DataFrame.
df38 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.', '17 West Livingston Court']
})
word='Avenue'
df38['Ext_Sent']=df38.address.apply(lambda x:re.findall(r'([^.]*'+word+'[^.]*)',x))
df38
company_code date_of_sale address Ext_Sent
0 Abcd 12/05/2002 9910 Surrey Avenue [9910 Surrey Avenue]
1 EFGF 16/02/1999 92 N. Bishop Avenue [ Bishop Avenue]
2 zefsalf 05/09/1998 9910 Golden Star Avenue [9910 Golden Star Avenue]
3 sdfslew 12/02/2022 102 Dunbar St. []
4 zekfsdf 15/09/1997 17 West Livingston Court []
  1. Write a Pandas program to extract the unique sentences from a given column of a given DataFrame.
df39 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue\n9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.\n102 Dunbar St.', '17 West Livingston Court']
})
df39
company_code date_of_sale address
0 Abcd 12/05/2002 9910 Surrey Avenue\n9910 Surrey Avenue
1 EFGF 16/02/1999 92 N. Bishop Avenue
2 zefsalf 05/09/1998 9910 Golden Star Avenue
3 sdfslew 12/02/2022 102 Dunbar St.\n102 Dunbar St.
4 zekfsdf 15/09/1997 17 West Livingston Court
df39['unique_Statements']=df39.address.apply(lambda x:re.findall(r'(?sm)(^[^\r\n]+$)(?!.*^\1$)', x))
df39
company_code date_of_sale address unique_Statements
0 Abcd 12/05/2002 9910 Surrey Avenue\n9910 Surrey Avenue [9910 Surrey Avenue]
1 EFGF 16/02/1999 92 N. Bishop Avenue [92 N. Bishop Avenue]
2 zefsalf 05/09/1998 9910 Golden Star Avenue [9910 Golden Star Avenue]
3 sdfslew 12/02/2022 102 Dunbar St.\n102 Dunbar St. [102 Dunbar St.]
4 zekfsdf 15/09/1997 17 West Livingston Court [17 West Livingston Court]
re.findall(r'(?sm)(^[^\r\n]+$)(?!.*^\1$)', '9910 Surrey Avenue\n9910 Surrey Avenue')
['9910 Surrey Avenue']

  1. Write a Pandas program to extract words starting with capital words from a given column of a given DataFrame
df40 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.', '17 West Livingston Court']
})
df40['star_Capital']=df40.address.apply(lambda x:",".join(re.findall(r'\b[A-Z]+\S+',x)))
df40
company_code date_of_sale address star_Capital
0 Abcd 12/05/2002 9910 Surrey Avenue Surrey,Avenue
1 EFGF 16/02/1999 92 N. Bishop Avenue N.,Bishop,Avenue
2 zefsalf 05/09/1998 9910 Golden Star Avenue Golden,Star,Avenue
3 sdfslew 12/02/2022 102 Dunbar St. Dunbar,St.
4 zekfsdf 15/09/1997 17 West Livingston Court West,Livingston,Court
  1. Write a Pandas program to remove the html tags within the specified column of a given DataFrame.
df41 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey <b>Avenue</b>','92 N. Bishop Avenue','9910 <br>Golden Star Avenue', '102 Dunbar <i></i>St.', '17 West Livingston Court']
})
re.sub('<.*?>','','9910 Surrey <b>Avenue</b>')
'9910 Surrey Avenue'
df41['removed_html']=df41.address.apply(lambda x:re.sub(r'<.*?>','',x))
df41
company_code date_of_sale address removed_html
0 Abcd 12/05/2002 9910 Surrey <b>Avenue</b> 9910 Surrey Avenue
1 EFGF 16/02/1999 92 N. Bishop Avenue 92 N. Bishop Avenue
2 zefsalf 05/09/1998 9910 <br>Golden Star Avenue 9910 Golden Star Avenue
3 sdfslew 12/02/2022 102 Dunbar <i></i>St. 102 Dunbar St.
4 zekfsdf 15/09/1997 17 West Livingston Court 17 West Livingston Court