Python Pandas - String and Regular Expression
Exercise - Pandas String and Regular Expression
import pandas as pd
import numpy as np
import re
- Write a Pandas program to convert all the string values to upper, lower cases in a given pandas series. Also find the length of the string values.
s1 = pd.Series(['X', 'Y', 'Z', 'Aaba', 'Baca', np.nan, 'CABA', None, 'bird', 'horse', 'dog'])
s1.str.upper()
0 X
1 Y
2 Z
3 AABA
4 BACA
5 NaN
6 CABA
7 None
8 BIRD
9 HORSE
10 DOG
dtype: object
s1.str.lower()
0 x
1 y
2 z
3 aaba
4 baca
5 NaN
6 caba
7 None
8 bird
9 horse
10 dog
dtype: object
s1.str.len()
0 1.0
1 1.0
2 1.0
3 4.0
4 4.0
5 NaN
6 4.0
7 NaN
8 4.0
9 5.0
10 3.0
dtype: float64
- Write a Pandas program to remove whitespaces, left sided whitespaces and right sided whitespaces of the string values of a given pandas series.
s2 = pd.Index([' Green', 'Black ', ' Red ', 'White', ' Pink '])
s2.str.strip()
Index(['Green', 'Black', 'Red', 'White', 'Pink'], dtype='object')
s2.str.lstrip()
Index(['Green', 'Black ', 'Red ', 'White', 'Pink '], dtype='object')
s2.str.rstrip()
Index([' Green', 'Black', ' Red', 'White', ' Pink'], dtype='object')
- Write a Pandas program to add leading zeros to the integer column in a pandas series and makes the length of the field to 8 digit.
df3=pd.DataFrame({'amount': [10, 250, 3000, 40000, 500000]})
#Solution 1
df3['amount'].apply(lambda x:'{0:0>8}'.format(x))
0 00000010
1 00000250
2 00003000
3 00040000
4 00500000
Name: amount, dtype: object
#Solution 2
df3['amount'].apply(lambda x:str(x).zfill(8))
0 00000010
1 00000250
2 00003000
3 00040000
4 00500000
Name: amount, dtype: object
- Write a Pandas program to add leading zeros to the character column in a pandas series and makes the length of the field to 8 digit.
df4 = pd.DataFrame({'amount': ['10', '250', '3000', '40000', '500000']})
#solution 1
df3.amount.apply(lambda x:'{0:0>8}'.format(x))
0 00000010
1 00000250
2 00003000
3 00040000
4 00500000
Name: amount, dtype: object
#Solution 2
df4.amount.apply(lambda x:x.zfill(8))
0 00000010
1 00000250
2 00003000
3 00040000
4 00500000
Name: amount, dtype: object
- Write a Pandas program to capitalize all the string values of specified columns of a given DataFrame.
df5 = pd.DataFrame({
'name': ['alberto','gino','ryan', 'Eesha', 'syed'],
'date_of_birth ': ['17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
'age': [18.5, 21.2, 22.5, 22, 23]
})
df5
name | date_of_birth | age | |
---|---|---|---|
0 | alberto | 17/05/2002 | 18.5 |
1 | gino | 16/02/1999 | 21.2 |
2 | ryan | 25/09/1998 | 22.5 |
3 | Eesha | 11/05/2002 | 22.0 |
4 | syed | 15/09/1997 | 23.0 |
df5.name.str.capitalize()
0 Alberto
1 Gino
2 Ryan
3 Eesha
4 Syed
Name: name, dtype: object
- Write a Pandas program to count of occurrence of a specified substring in a DataFrame column.
df6 = pd.DataFrame({
'name_code': ['c001','c002','c022', 'c2002', 'c2222'],
'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'age': [18.5, 21.2, 22.5, 22, 23]
})
df6
name_code | date_of_birth | age | |
---|---|---|---|
0 | c001 | 12/05/2002 | 18.5 |
1 | c002 | 16/02/1999 | 21.2 |
2 | c022 | 25/09/1998 | 22.5 |
3 | c2002 | 12/02/2022 | 22.0 |
4 | c2222 | 15/09/1997 | 23.0 |
print("\nCount occurrence of 2 in name_code column:")
df6['count']=df6.name_code.apply(lambda x:x.count('2'))
df6
Count occurrence of 2 in name_code column:
name_code | date_of_birth | age | count | |
---|---|---|---|---|
0 | c001 | 12/05/2002 | 18.5 | 0 |
1 | c002 | 16/02/1999 | 21.2 | 1 |
2 | c022 | 25/09/1998 | 22.5 | 2 |
3 | c2002 | 12/02/2022 | 22.0 | 2 |
4 | c2222 | 15/09/1997 | 23.0 | 4 |
- Write a Pandas program to find the index of a given substring of a DataFrame column.
df7 = pd.DataFrame({
'name_code': ['c001','c002','c022', 'c2002', 'c2222'],
'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'age': [18.5, 21.2, 22.5, 22, 23]
})
print("\nCount occurrence of 22 in name_code column:")
df7['Occur_22']=df7.name_code.apply(lambda x: 'Not Available' if x.find('22')==-1 else x.find('22'))
#df7['Occur_22'] = list(map(lambda x: x.find('22'), df7['name_code']))
df7
Count occurrence of 22 in name_code column:
name_code | date_of_birth | age | Occur_22 | |
---|---|---|---|---|
0 | c001 | 12/05/2002 | 18.5 | Not Available |
1 | c002 | 16/02/1999 | 21.2 | Not Available |
2 | c022 | 25/09/1998 | 22.5 | 2 |
3 | c2002 | 12/02/2022 | 22.0 | Not Available |
4 | c2222 | 15/09/1997 | 23.0 | 1 |
- Write a Pandas program to find the index of a substring of DataFrame within provided index boundary (beginning and end position).
df8 = pd.DataFrame({
'name_code': ['c0001','1000c','b00c2', 'b2c02', 'c2222'],
'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'age': [18.5, 21.2, 22.5, 22, 23]
})
df8['occur_c_within_bound']=df8.name_code.apply(lambda x:x.find('c',0,4)) # 0 is staring index, 4 is end position
df8
name_code | date_of_birth | age | occur_c_within_bound | |
---|---|---|---|---|
0 | c0001 | 12/05/2002 | 18.5 | 0 |
1 | 1000c | 16/02/1999 | 21.2 | -1 |
2 | b00c2 | 25/09/1998 | 22.5 | 3 |
3 | b2c02 | 12/02/2022 | 22.0 | 2 |
4 | c2222 | 15/09/1997 | 23.0 | 0 |
- Write a Pandas program to check whether alpha numeric values present in a given column of a DataFrame.
df9 = pd.DataFrame({
'name_code': ['Company','Company a001','Company 123', '1234', 'Company12'],
'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'age': [18.5, 21.2, 22.5, 22, 23]
})
df9['AphaNum_Flag']=df9.name_code.apply(lambda x:x.isalnum())
df9
name_code | date_of_birth | age | AphaNum_Flag | |
---|---|---|---|---|
0 | Company | 12/05/2002 | 18.5 | True |
1 | Company a001 | 16/02/1999 | 21.2 | False |
2 | Company 123 | 25/09/1998 | 22.5 | False |
3 | 1234 | 12/02/2022 | 22.0 | True |
4 | Company12 | 15/09/1997 | 23.0 | True |
- Write a Pandas program to check whether only alphabetic values present in a given column of a DataFrame.
df10 = pd.DataFrame({
'company_code': ['Company','Company a001','Company 123', 'abcd', 'Company 12'],
'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df10['Alphabetic_flag']=df10.company_code.apply(lambda x: x.isalpha())
df10
company_code | date_of_sale | sale_amount | Alphabetic_flag | |
---|---|---|---|---|
0 | Company | 12/05/2002 | 12348.5 | True |
1 | Company a001 | 16/02/1999 | 233331.2 | False |
2 | Company 123 | 25/09/1998 | 22.5 | False |
3 | abcd | 12/02/2022 | 2566552.0 | True |
4 | Company 12 | 15/09/1997 | 23.0 | False |
- Write a Pandas program to check whether only numeric values present in a given column of a DataFrame
df11 = pd.DataFrame({
'company_code': ['Company','Company a001', '2055', 'abcd', '123345'],
'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df11['Numeric_flag']=df11.company_code.apply(lambda x:x.isdigit())
df11
company_code | date_of_sale | sale_amount | Numeric_flag | |
---|---|---|---|---|
0 | Company | 12/05/2002 | 12348.5 | False |
1 | Company a001 | 16/02/1999 | 233331.2 | False |
2 | 2055 | 25/09/1998 | 22.5 | True |
3 | abcd | 12/02/2022 | 2566552.0 | False |
4 | 123345 | 15/09/1997 | 23.0 | True |
- Write a Pandas program to check whether only lower case or upper case is present in a given column of a DataFrame.
df12 = pd.DataFrame({
'company_code': ['ABCD','EFGF', 'hhhh', 'abcd', 'EAWQaaa'],
'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df12['islower']=df12.company_code.apply(lambda x:(x.islower()))
df12['isupper']=df12.company_code.apply(lambda x:(x.isupper()))
df12
company_code | date_of_sale | sale_amount | islower | isupper | |
---|---|---|---|---|---|
0 | ABCD | 12/05/2002 | 12348.5 | False | True |
1 | EFGF | 16/02/1999 | 233331.2 | False | True |
2 | hhhh | 25/09/1998 | 22.5 | True | False |
3 | abcd | 12/02/2022 | 2566552.0 | True | False |
4 | EAWQaaa | 15/09/1997 | 23.0 | False | False |
- Write a Pandas program to check whether only proper case or title case is present in a given column of a DataFrame.
df13 = pd.DataFrame({
'company_code': ['ABCD','EFGF', 'Hhhh', 'abcd', 'EAWQaaa'],
'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df13['istitle']=df13.company_code.apply(lambda x:x.istitle())
df13
company_code | date_of_sale | sale_amount | istitle | |
---|---|---|---|---|
0 | ABCD | 12/05/2002 | 12348.5 | False |
1 | EFGF | 16/02/1999 | 233331.2 | False |
2 | Hhhh | 25/09/1998 | 22.5 | True |
3 | abcd | 12/02/2022 | 2566552.0 | False |
4 | EAWQaaa | 15/09/1997 | 23.0 | False |
- Write a Pandas program to check whether only space is present in a given column of a DataFrame
df14 = pd.DataFrame({
'company_code': ['Abcd','EFGF ', ' ', 'abcd', ' '],
'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df14['isspace']=df14.company_code.apply(lambda x:x.isspace())
#Alternative
#df14['isspace']=df14.company_code.str.isspace()
df14
company_code | date_of_sale | sale_amount | isspace | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 12348.5 | False |
1 | EFGF | 16/02/1999 | 233331.2 | False |
2 | 25/09/1998 | 22.5 | True | |
3 | abcd | 12/02/2022 | 2566552.0 | False |
4 | 15/09/1997 | 23.0 | True |
- Write a Pandas program to get the length of the string present of a given column in a DataFrame.
df15 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'skfsalf', 'sdfslew', 'safsdf'],
'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})
df15['length_company']=df15.company_code.str.len()
df15
company_code | date_of_sale | sale_amount | length_company | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 12348.5 | 4 |
1 | EFGF | 16/02/1999 | 233331.2 | 4 |
2 | skfsalf | 25/09/1998 | 22.5 | 7 |
3 | sdfslew | 12/02/2022 | 2566552.0 | 7 |
4 | safsdf | 15/09/1997 | 23.0 | 6 |
- Write a Pandas program to get the length of the integer of a given column in a DataFrame.
df16 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'skfsalf', 'sdfslew', 'safsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df16['sale_amnt_len']=df16.sale_amount.astype('str').str.len()
df16
company_code | date_of_sale | sale_amount | sale_amnt_len | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 12348.5 | 7 |
1 | EFGF | 16/02/1999 | 233331.2 | 8 |
2 | skfsalf | 25/09/1998 | 22.5 | 4 |
3 | sdfslew | 12/02/2022 | 2566552.0 | 9 |
4 | safsdf | 15/09/1997 | 23.0 | 4 |
- Write a Pandas program to check if a specified column starts with a specified string in a DataFrame.
df17 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df17['company_code_starts_with']=df17.company_code.str.startswith('ze')
df17
company_code | date_of_sale | sale_amount | company_code_starts_with | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 12348.5 | False |
1 | EFGF | 16/02/1999 | 233331.2 | False |
2 | zefsalf | 25/09/1998 | 22.5 | True |
3 | sdfslew | 12/02/2022 | 2566552.0 | False |
4 | zekfsdf | 15/09/1997 | 23.0 | True |
- Write a Pandas program to swap the cases of a specified character column in a given DataFrame.
df18 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df18['swapcase_cc']=df18.company_code.str.swapcase()
df18
company_code | date_of_sale | sale_amount | swapcase_cc | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 12348.5 | aBCD |
1 | EFGF | 16/02/1999 | 233331.2 | efgf |
2 | zefsalf | 25/09/1998 | 22.5 | ZEFSALF |
3 | sdfslew | 12/02/2022 | 2566552.0 | SDFSLEW |
4 | zekfsdf | 15/09/1997 | 23.0 | ZEKFSDF |
- Write a Pandas program to convert a specified character column in upper/lower cases in a given DataFrame.
df19 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df19['upper_cc']=df19.company_code.str.upper()
df19['lower_cc']=df19.company_code.str.lower()
df19
company_code | date_of_sale | sale_amount | upper_cc | lower_cc | |
---|---|---|---|---|---|
0 | Abcd | 12/05/2002 | 12348.5 | ABCD | abcd |
1 | EFGF | 16/02/1999 | 233331.2 | EFGF | efgf |
2 | zefsalf | 25/09/1998 | 22.5 | ZEFSALF | zefsalf |
3 | sdfslew | 12/02/2022 | 2566552.0 | SDFSLEW | sdfslew |
4 | zekfsdf | 15/09/1997 | 23.0 | ZEKFSDF | zekfsdf |
- Write a Pandas program to convert a specified character column in title case in a given DataFrame.
df20 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df20['title_cc']=df20.company_code.str.title()
df20
company_code | date_of_sale | sale_amount | title_cc | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 12348.5 | Abcd |
1 | EFGF | 16/02/1999 | 233331.2 | Efgf |
2 | zefsalf | 25/09/1998 | 22.5 | Zefsalf |
3 | sdfslew | 12/02/2022 | 2566552.0 | Sdfslew |
4 | zekfsdf | 15/09/1997 | 23.0 | Zekfsdf |
- Write a Pandas program to replace arbitrary values with other values in a given DataFrame.
df21 = pd.DataFrame({
'company_code': ['A','B', 'C', 'D', 'A'],
'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
#replacing A with C
df21.replace('A','C')
company_code | date_of_sale | sale_amount | |
---|---|---|---|
0 | C | 12/05/2002 | 12348.5 |
1 | B | 16/02/1999 | 233331.2 |
2 | C | 25/09/1998 | 22.5 |
3 | D | 12/02/2022 | 2566552.0 |
4 | C | 15/09/1997 | 23.0 |
- Write a Pandas program to replace more than one value with other values in a given DataFrame.
df22 = pd.DataFrame({
'company_code': ['A','B', 'C', 'D', 'A'],
'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df22.replace(['A','C'],['X','Y'])
company_code | date_of_sale | sale_amount | |
---|---|---|---|
0 | X | 12/05/2002 | 12348.5 |
1 | B | 16/02/1999 | 233331.2 |
2 | Y | 25/09/1998 | 22.5 |
3 | D | 12/02/2022 | 2566552.0 |
4 | X | 15/09/1997 | 23.0 |
- Write a Pandas program to split a string of a column of a given DataFrame into multiple columns.
df23 = pd.DataFrame({
'name': ['Alberto Franco','Gino Ann Mcneill','Ryan Parkes', 'Eesha Artur Hinton', 'Syed Wharton'],
'date_of_birth ': ['17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
'age': [18.5, 21.2, 22.5, 22, 23]
})
df23['First Middle Last'.split()]=df23.name.str.split(expand=True)
df23
name | date_of_birth | age | First | Middle | Last | |
---|---|---|---|---|---|---|
0 | Alberto Franco | 17/05/2002 | 18.5 | Alberto | Franco | None |
1 | Gino Ann Mcneill | 16/02/1999 | 21.2 | Gino | Ann | Mcneill |
2 | Ryan Parkes | 25/09/1998 | 22.5 | Ryan | Parkes | None |
3 | Eesha Artur Hinton | 11/05/2002 | 22.0 | Eesha | Artur | Hinton |
4 | Syed Wharton | 15/09/1997 | 23.0 | Syed | Wharton | None |
- Write a Pandas program to extract email from a specified column of string type of a given DataFrame
df24 = pd.DataFrame({
'name_email': ['Alberto Franco af@gmail.com','Gino Mcneill gm@yahoo.com','Ryan Parkes rp@abc.io', 'Eesha Hinton', 'Gino Mcneill gm@github.com']
})
df24['email']=df24.name_email.apply(lambda x:re.findall('\S+@+\S+',str(x)))
df24
name_email | ||
---|---|---|
0 | Alberto Franco af@gmail.com | [af@gmail.com] |
1 | Gino Mcneill gm@yahoo.com | [gm@yahoo.com] |
2 | Ryan Parkes rp@abc.io | [rp@abc.io] |
3 | Eesha Hinton | [] |
4 | Gino Mcneill gm@github.com | [gm@github.com] |
df24['email2']=df24.name_email.apply(lambda x:','.join(re.findall(r'[\w\.-]+@[\w\.-]+',str(x))))
df24
name_email | email2 | ||
---|---|---|---|
0 | Alberto Franco af@gmail.com | [af@gmail.com] | af@gmail.com |
1 | Gino Mcneill gm@yahoo.com | [gm@yahoo.com] | gm@yahoo.com |
2 | Ryan Parkes rp@abc.io | [rp@abc.io] | rp@abc.io |
3 | Eesha Hinton | [] | |
4 | Gino Mcneill gm@github.com | [gm@github.com] | gm@github.com |
- Write a Pandas program to extract hash attached word from twitter text from the specified column of a given DataFrame.
df25 = pd.DataFrame({
'tweets': ['#Obama says goodbye','Retweets for #cash','A political endorsement in #Indonesia', '1 dog = many #retweets', 'Just a simple #egg']
})
df25['hashtags']=df25.tweets.apply(lambda x: ' '.join(re.findall(r'(?<=#)\w+',x)))
df25
tweets | hashtags | |
---|---|---|
0 | #Obama says goodbye | Obama |
1 | Retweets for #cash | cash |
2 | A political endorsement in #Indonesia | Indonesia |
3 | 1 dog = many #retweets | retweets |
4 | Just a simple #egg | egg |
- Write a Pandas program to extract word mention someone in tweets using @ from the specified column of a given DataFrame.
df26 = pd.DataFrame({
'tweets': ['@Obama says goodbye','Retweets for @cash','A political endorsement in @Indonesia', '1 dog = many #retweets', 'Just a simple #egg']
})
df26['at_words']=df26.tweets.apply(lambda x:' '.join(re.findall(r'(?<=@)\w+',x)))
df26
tweets | at_words | |
---|---|---|
0 | @Obama says goodbye | Obama |
1 | Retweets for @cash | cash |
2 | A political endorsement in @Indonesia | Indonesia |
3 | 1 dog = many #retweets | |
4 | Just a simple #egg |
- Write a Pandas program to extract only number from the specified column of a given DataFrame.
df27 = pd.DataFrame({
'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
'address': ['7277 Surrey Ave.','920 N. Bishop Ave.','9910 Golden Star St.', '25 Dunbar St.', '17 West Livingston Court']
})
df27['numeric_in_address']=df27.address.apply(lambda x:','.join(re.findall(r'[0-9]+',x)))
df27
company_code | address | numeric_in_address | |
---|---|---|---|
0 | c0001 | 7277 Surrey Ave. | 7277 |
1 | c0002 | 920 N. Bishop Ave. | 920 |
2 | c0003 | 9910 Golden Star St. | 9910 |
3 | c0003 | 25 Dunbar St. | 25 |
4 | c0004 | 17 West Livingston Court | 17 |
- Write a Pandas program to extract only phone number from the specified column of a given DataFrame.
df28 = pd.DataFrame({
'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
'company_phone_no': ['Company1-Phone no. 4695168357','Company2-Phone no. 8088729013','Company3-Phone no. 6204658086', 'Company4-Phone no. 5159530096', 'Company5-Phone no. 9037952371']
})
re.findall(r'\b\d{10}\b',df28.company_phone_no[1])
['8088729013']
df28['phone_number']=df28.company_phone_no.apply(lambda x:','.join(re.findall(r'\b\d{10}\b',x)))
df28
company_code | company_phone_no | phone_number | |
---|---|---|---|
0 | c0001 | Company1-Phone no. 4695168357 | 4695168357 |
1 | c0002 | Company2-Phone no. 8088729013 | 8088729013 |
2 | c0003 | Company3-Phone no. 6204658086 | 6204658086 |
3 | c0003 | Company4-Phone no. 5159530096 | 5159530096 |
4 | c0004 | Company5-Phone no. 9037952371 | 9037952371 |
- Write a Pandas program to extract year between 1800 to 2200 from the specified column of a given DataFrame.
df29 = pd.DataFrame({
'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
})
df29.year.str.split(expand=True)
0 | 1 | |
---|---|---|
0 | year | 1800 |
1 | year | 1700 |
2 | year | 2300 |
3 | year | 1900 |
4 | year | 2200 |
re.findall(r'\b([1][8-9]+\d{02}|[2][1]\d{2}|2200)\b','year 1800 year 1700 year 2300 year 1900 year 2200 year 1711')
['1800', '1900', '2200']
df29['year_btw']=df29.year.apply(lambda x:','.join(re.findall(r'\b([1][8-9]+\d{02}|[2][1]\d{2}|2200)\b',x)))
df29
company_code | year | year_btw | |
---|---|---|---|
0 | c0001 | year 1800 | 1800 |
1 | c0002 | year 1700 | |
2 | c0003 | year 2300 | |
3 | c0003 | year 1900 | 1900 |
4 | c0004 | year 2200 | 2200 |
- Write a Pandas program to extract only non alphanumeric characters from the specified column of a given DataFrame.
df30 = pd.DataFrame({
'company_code': ['c0001#','c00@0^2','$c0003', 'c0003', '&c0004'],
'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
})
re.findall(r'[^A-Za-z0-9]','c00@0^2')
['@', '^']
df30['special_char']=df30.company_code.apply(lambda x:','.join(re.findall(r'[^a-z0-9]',x)))
df30
company_code | year | special_char | |
---|---|---|---|
0 | c0001# | year 1800 | # |
1 | c00@0^2 | year 1700 | @,^ |
2 | $c0003 | year 2300 | $ |
3 | c0003 | year 1900 | |
4 | &c0004 | year 2200 | & |
- Write a Pandas program to extract only punctuations from the specified column of a given DataFrame.
df31 = pd.DataFrame({
'company_code': ['c0001.','c000,2','c0003', 'c0003#', 'c0004,'],
'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
})
df31['puct']=df31.company_code.apply(lambda x:re.findall(r'[^\w\s]',x))
df31
company_code | year | puct | |
---|---|---|---|
0 | c0001. | year 1800 | [.] |
1 | c000,2 | year 1700 | [,] |
2 | c0003 | year 2300 | [] |
3 | c0003# | year 1900 | [#] |
4 | c0004, | year 2200 | [,] |
- Write a Pandas program to remove repetitive characters from the specified column of a given DataFrame.
df32 = pd.DataFrame({
'text_code': ['t0001.','t0002','t0003', 't0004'],
'text_lang': ['She livedd a long life.', 'How oold is your father?', 'What is tthe problem?','TThhis desk is used by Tom.']
})
def rep_char(str1):
tchr = str1.group(0)
if len(tchr) > 1:
return tchr[0:1] # can change the value here on repetition
def unique_char(rep, sent_text):
convert = re.sub(r'(\w)\1+', rep, sent_text)
return convert
df32['normal_text']=df32['text_lang'].apply(lambda x : unique_char(rep_char,x))
df32
text_code | text_lang | normal_text | |
---|---|---|---|
0 | t0001. | She livedd a long life. | She lived a long life. |
1 | t0002 | How oold is your father? | How old is your father? |
2 | t0003 | What is tthe problem? | What is the problem? |
3 | t0004 | TThhis desk is used by Tom. | This desk is used by Tom. |
- Write a Pandas program to extract numbers greater than 940 from the specified column of a given DataFrame.
df33 = pd.DataFrame({
'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
'address': ['7277 Surrey Ave.1111','920 N. Bishop Ave.','9910 Golden Star St.', '1025 Dunbar St.', '1700 West Livingston Court']
})
df33
company_code | address | |
---|---|---|
0 | c0001 | 7277 Surrey Ave.1111 |
1 | c0002 | 920 N. Bishop Ave. |
2 | c0003 | 9910 Golden Star St. |
3 | c0003 | 1025 Dunbar St. |
4 | c0004 | 1700 West Livingston Court |
df33['num_great']=df33.address.apply(lambda x:','.join(re.findall(r'([9][4-9]\d|[1-9]\d{3})',x)))
df33
company_code | address | num_great | |
---|---|---|---|
0 | c0001 | 7277 Surrey Ave.1111 | 7277,1111 |
1 | c0002 | 920 N. Bishop Ave. | |
2 | c0003 | 9910 Golden Star St. | 991 |
3 | c0003 | 1025 Dunbar St. | 1025 |
4 | c0004 | 1700 West Livingston Court | 1700 |
- Write a Pandas program to extract numbers less than 100 from the specified column of a given DataFrame.
df34 = pd.DataFrame({
'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
'address': ['72 Surrey Ave.11','92 N. Bishop Ave.','9910 Golden Star St.', '102 Dunbar St.', '17 West Livingston Court']
})
df34['num_less']=df34.address.apply(lambda x:re.findall(r'(\b\d{2}\b)',x))
df34
company_code | address | num_less | |
---|---|---|---|
0 | c0001 | 72 Surrey Ave.11 | [72, 11] |
1 | c0002 | 92 N. Bishop Ave. | [92] |
2 | c0003 | 9910 Golden Star St. | [] |
3 | c0003 | 102 Dunbar St. | [] |
4 | c0004 | 17 West Livingston Court | [17] |
- Write a Pandas program to check whether two given words present in a specified column of a given DataFrame.
df35 = pd.DataFrame({
'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
'address': ['9910 Surrey Ave.','92 N. Bishop Ave.','9910 Golden Star Ave.', '102 Dunbar St.', '17 West Livingston Court']
})
df35['check_words']=df35.address.apply(lambda x:re.findall(r'(?=.*Ave.)(?=.*9910).*',x))
df35
company_code | address | check_words | |
---|---|---|---|
0 | c0001 | 9910 Surrey Ave. | [9910 Surrey Ave.] |
1 | c0002 | 92 N. Bishop Ave. | [] |
2 | c0003 | 9910 Golden Star Ave. | [9910 Golden Star Ave.] |
3 | c0003 | 102 Dunbar St. | [] |
4 | c0004 | 17 West Livingston Court | [] |
- Write a Pandas program to extract valid date (format: mm-dd-yyyy) from a given column of a given DataFrame.
df36 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})
df36['valid_dates']=df36.date_of_sale.apply(lambda x:re.findall(r'(1[0-2]|0[1-9])/(3[0-1]|[1-2][0-9]|0\d)/(\d{4})',x))
df36
company_code | date_of_sale | sale_amount | valid_dates | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 12348.5 | [(12, 05, 2002)] |
1 | EFGF | 16/02/1999 | 233331.2 | [] |
2 | zefsalf | 05/09/1998 | 22.5 | [(05, 09, 1998)] |
3 | sdfslew | 12/02/2022 | 2566552.0 | [(12, 02, 2022)] |
4 | zekfsdf | 15/09/1997 | 23.0 | [] |
- Write a Pandas program to extract only words from a given column of a given DataFrame.
df37 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
'address': ['9910 Surrey Ave.','92 N. Bishop Ave.','9910 Golden Star Ave.', '102 Dunbar St.', '17 West Livingston Court']
})
re.findall(r'\S+[a-zA-Z]+\S+','92 N. Bishop Ave.')
['Bishop', 'Ave.']
df37['only_words']=df37.address.apply(lambda x:' '.join(re.findall(r'\b[^\d\W]+\b',x)))
df37
company_code | date_of_sale | address | only_words | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 9910 Surrey Ave. | Surrey Ave |
1 | EFGF | 16/02/1999 | 92 N. Bishop Ave. | N Bishop Ave |
2 | zefsalf | 05/09/1998 | 9910 Golden Star Ave. | Golden Star Ave |
3 | sdfslew | 12/02/2022 | 102 Dunbar St. | Dunbar St |
4 | zekfsdf | 15/09/1997 | 17 West Livingston Court | West Livingston Court |
- Write a Pandas program to extract the sentences where a specific word is present in a given column of a given DataFrame.
df38 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
'address': ['9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.', '17 West Livingston Court']
})
word='Avenue'
df38['Ext_Sent']=df38.address.apply(lambda x:re.findall(r'([^.]*'+word+'[^.]*)',x))
df38
company_code | date_of_sale | address | Ext_Sent | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 9910 Surrey Avenue | [9910 Surrey Avenue] |
1 | EFGF | 16/02/1999 | 92 N. Bishop Avenue | [ Bishop Avenue] |
2 | zefsalf | 05/09/1998 | 9910 Golden Star Avenue | [9910 Golden Star Avenue] |
3 | sdfslew | 12/02/2022 | 102 Dunbar St. | [] |
4 | zekfsdf | 15/09/1997 | 17 West Livingston Court | [] |
- Write a Pandas program to extract the unique sentences from a given column of a given DataFrame.
df39 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
'address': ['9910 Surrey Avenue\n9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.\n102 Dunbar St.', '17 West Livingston Court']
})
df39
company_code | date_of_sale | address | |
---|---|---|---|
0 | Abcd | 12/05/2002 | 9910 Surrey Avenue\n9910 Surrey Avenue |
1 | EFGF | 16/02/1999 | 92 N. Bishop Avenue |
2 | zefsalf | 05/09/1998 | 9910 Golden Star Avenue |
3 | sdfslew | 12/02/2022 | 102 Dunbar St.\n102 Dunbar St. |
4 | zekfsdf | 15/09/1997 | 17 West Livingston Court |
df39['unique_Statements']=df39.address.apply(lambda x:re.findall(r'(?sm)(^[^\r\n]+$)(?!.*^\1$)', x))
df39
company_code | date_of_sale | address | unique_Statements | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 9910 Surrey Avenue\n9910 Surrey Avenue | [9910 Surrey Avenue] |
1 | EFGF | 16/02/1999 | 92 N. Bishop Avenue | [92 N. Bishop Avenue] |
2 | zefsalf | 05/09/1998 | 9910 Golden Star Avenue | [9910 Golden Star Avenue] |
3 | sdfslew | 12/02/2022 | 102 Dunbar St.\n102 Dunbar St. | [102 Dunbar St.] |
4 | zekfsdf | 15/09/1997 | 17 West Livingston Court | [17 West Livingston Court] |
re.findall(r'(?sm)(^[^\r\n]+$)(?!.*^\1$)', '9910 Surrey Avenue\n9910 Surrey Avenue')
['9910 Surrey Avenue']
- Write a Pandas program to extract words starting with capital words from a given column of a given DataFrame
df40 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
'address': ['9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.', '17 West Livingston Court']
})
df40['star_Capital']=df40.address.apply(lambda x:",".join(re.findall(r'\b[A-Z]+\S+',x)))
df40
company_code | date_of_sale | address | star_Capital | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 9910 Surrey Avenue | Surrey,Avenue |
1 | EFGF | 16/02/1999 | 92 N. Bishop Avenue | N.,Bishop,Avenue |
2 | zefsalf | 05/09/1998 | 9910 Golden Star Avenue | Golden,Star,Avenue |
3 | sdfslew | 12/02/2022 | 102 Dunbar St. | Dunbar,St. |
4 | zekfsdf | 15/09/1997 | 17 West Livingston Court | West,Livingston,Court |
- Write a Pandas program to remove the html tags within the specified column of a given DataFrame.
df41 = pd.DataFrame({
'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
'address': ['9910 Surrey <b>Avenue</b>','92 N. Bishop Avenue','9910 <br>Golden Star Avenue', '102 Dunbar <i></i>St.', '17 West Livingston Court']
})
re.sub('<.*?>','','9910 Surrey <b>Avenue</b>')
'9910 Surrey Avenue'
df41['removed_html']=df41.address.apply(lambda x:re.sub(r'<.*?>','',x))
df41
company_code | date_of_sale | address | removed_html | |
---|---|---|---|---|
0 | Abcd | 12/05/2002 | 9910 Surrey <b>Avenue</b> | 9910 Surrey Avenue |
1 | EFGF | 16/02/1999 | 92 N. Bishop Avenue | 92 N. Bishop Avenue |
2 | zefsalf | 05/09/1998 | 9910 <br>Golden Star Avenue | 9910 Golden Star Avenue |
3 | sdfslew | 12/02/2022 | 102 Dunbar <i></i>St. | 102 Dunbar St. |
4 | zekfsdf | 15/09/1997 | 17 West Livingston Court | 17 West Livingston Court |