Python Pandas - String and Regular Expression

17 minute read

Exercise - Pandas String and Regular Expression

import pandas as pd
import numpy as np
import re

Write a Pandas program to convert all the string values to upper, lower cases in a given pandas series. Also find the length of the string values.

s1 = pd.Series(['X', 'Y', 'Z', 'Aaba', 'Baca', np.nan, 'CABA', None, 'bird', 'horse', 'dog'])

s1.str.upper()

       X
       Y
       Z
    AABA
    BACA
     NaN
    CABA
    None
    BIRD
   HORSE
    DOG
dtype: object

s1.str.lower()

       x
       y
       z
    aaba
    baca
     NaN
    caba
    None
    bird
   horse
    dog
dtype: object

s1.str.len()

   1.0
   1.0
   1.0
   4.0
   4.0
   NaN
   4.0
   NaN
   4.0
   5.0
  3.0
dtype: float64

Write a Pandas program to remove whitespaces, left sided whitespaces and right sided whitespaces of the string values of a given pandas series.

s2 = pd.Index([' Green', 'Black ', ' Red ', 'White', ' Pink '])

s2.str.strip()

Index(['Green', 'Black', 'Red', 'White', 'Pink'], dtype='object')

s2.str.lstrip()

Index(['Green', 'Black ', 'Red ', 'White', 'Pink '], dtype='object')

s2.str.rstrip()

Index([' Green', 'Black', ' Red', 'White', ' Pink'], dtype='object')

Write a Pandas program to add leading zeros to the integer column in a pandas series and makes the length of the field to 8 digit.

df3=pd.DataFrame({'amount': [10, 250, 3000, 40000, 500000]})

#Solution 1
df3['amount'].apply(lambda x:'{0:0>8}'.format(x))

  00000010
  00000250
  00003000
  00040000
  00500000
Name: amount, dtype: object

#Solution 2
df3['amount'].apply(lambda x:str(x).zfill(8))

  00000010
  00000250
  00003000
  00040000
  00500000
Name: amount, dtype: object

Write a Pandas program to add leading zeros to the character column in a pandas series and makes the length of the field to 8 digit.

df4 = pd.DataFrame({'amount': ['10', '250', '3000', '40000', '500000']})

#solution 1
df3.amount.apply(lambda x:'{0:0>8}'.format(x))

  00000010
  00000250
  00003000
  00040000
  00500000
Name: amount, dtype: object

#Solution 2
df4.amount.apply(lambda x:x.zfill(8))

  00000010
  00000250
  00003000
  00040000
  00500000
Name: amount, dtype: object

Write a Pandas program to capitalize all the string values of specified columns of a given DataFrame.

df5 = pd.DataFrame({
    'name': ['alberto','gino','ryan', 'Eesha', 'syed'],
    'date_of_birth ': ['17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})

df5

	name	date_of_birth	age
0	alberto	17/05/2002	18.5
1	gino	16/02/1999	21.2
2	ryan	25/09/1998	22.5
3	Eesha	11/05/2002	22.0
4	syed	15/09/1997	23.0

df5.name.str.capitalize()

  Alberto
     Gino
     Ryan
    Eesha
     Syed
Name: name, dtype: object

Write a Pandas program to count of occurrence of a specified substring in a DataFrame column.

df6 = pd.DataFrame({
    'name_code': ['c001','c002','c022', 'c2002', 'c2222'],
    'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})

df6

	name_code	date_of_birth	age
0	c001	12/05/2002	18.5
1	c002	16/02/1999	21.2
2	c022	25/09/1998	22.5
3	c2002	12/02/2022	22.0
4	c2222	15/09/1997	23.0

print("\nCount occurrence of 2 in name_code column:")

df6['count']=df6.name_code.apply(lambda x:x.count('2'))
df6

Count occurrence of 2 in name_code column:

	name_code	date_of_birth	age	count
0	c001	12/05/2002	18.5	0
1	c002	16/02/1999	21.2	1
2	c022	25/09/1998	22.5	2
3	c2002	12/02/2022	22.0	2
4	c2222	15/09/1997	23.0	4

Write a Pandas program to find the index of a given substring of a DataFrame column.

df7 = pd.DataFrame({
    'name_code': ['c001','c002','c022', 'c2002', 'c2222'],
    'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})

print("\nCount occurrence of 22 in name_code column:")
df7['Occur_22']=df7.name_code.apply(lambda x: 'Not Available' if x.find('22')==-1 else x.find('22'))
#df7['Occur_22'] = list(map(lambda x: x.find('22'), df7['name_code']))
df7

Count occurrence of 22 in name_code column:

	name_code	date_of_birth	age	Occur_22
0	c001	12/05/2002	18.5	Not Available
1	c002	16/02/1999	21.2	Not Available
2	c022	25/09/1998	22.5	2
3	c2002	12/02/2022	22.0	Not Available
4	c2222	15/09/1997	23.0	1

Write a Pandas program to find the index of a substring of DataFrame within provided index boundary (beginning and end position).

df8 = pd.DataFrame({
    'name_code': ['c0001','1000c','b00c2', 'b2c02', 'c2222'],
    'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})

df8['occur_c_within_bound']=df8.name_code.apply(lambda x:x.find('c',0,4)) # 0 is staring index, 4 is end position
df8

	name_code	date_of_birth	age	occur_c_within_bound
0	c0001	12/05/2002	18.5	0
1	1000c	16/02/1999	21.2	-1
2	b00c2	25/09/1998	22.5	3
3	b2c02	12/02/2022	22.0	2
4	c2222	15/09/1997	23.0	0

Write a Pandas program to check whether alpha numeric values present in a given column of a DataFrame.

df9 = pd.DataFrame({
    'name_code': ['Company','Company a001','Company 123', '1234', 'Company12'],
    'date_of_birth ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})

df9['AphaNum_Flag']=df9.name_code.apply(lambda x:x.isalnum())
df9

	name_code	date_of_birth	age	AphaNum_Flag
0	Company	12/05/2002	18.5	True
1	Company a001	16/02/1999	21.2	False
2	Company 123	25/09/1998	22.5	False
3	1234	12/02/2022	22.0	True
4	Company12	15/09/1997	23.0	True

Write a Pandas program to check whether only alphabetic values present in a given column of a DataFrame.

df10 = pd.DataFrame({
    'company_code': ['Company','Company a001','Company 123', 'abcd', 'Company 12'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})

df10['Alphabetic_flag']=df10.company_code.apply(lambda x: x.isalpha())
df10

	company_code	date_of_sale	sale_amount	Alphabetic_flag
0	Company	12/05/2002	12348.5	True
1	Company a001	16/02/1999	233331.2	False
2	Company 123	25/09/1998	22.5	False
3	abcd	12/02/2022	2566552.0	True
4	Company 12	15/09/1997	23.0	False

Write a Pandas program to check whether only numeric values present in a given column of a DataFrame

df11 = pd.DataFrame({
    'company_code': ['Company','Company a001', '2055', 'abcd', '123345'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})

df11['Numeric_flag']=df11.company_code.apply(lambda x:x.isdigit())
df11

	company_code	date_of_sale	sale_amount	Numeric_flag
0	Company	12/05/2002	12348.5	False
1	Company a001	16/02/1999	233331.2	False
2	2055	25/09/1998	22.5	True
3	abcd	12/02/2022	2566552.0	False
4	123345	15/09/1997	23.0	True

Write a Pandas program to check whether only lower case or upper case is present in a given column of a DataFrame.

df12 = pd.DataFrame({
    'company_code': ['ABCD','EFGF', 'hhhh', 'abcd', 'EAWQaaa'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})

df12['islower']=df12.company_code.apply(lambda x:(x.islower()))
df12['isupper']=df12.company_code.apply(lambda x:(x.isupper()))

df12

	company_code	date_of_sale	sale_amount	islower	isupper
0	ABCD	12/05/2002	12348.5	False	True
1	EFGF	16/02/1999	233331.2	False	True
2	hhhh	25/09/1998	22.5	True	False
3	abcd	12/02/2022	2566552.0	True	False
4	EAWQaaa	15/09/1997	23.0	False	False

Write a Pandas program to check whether only proper case or title case is present in a given column of a DataFrame.

df13 = pd.DataFrame({
    'company_code': ['ABCD','EFGF', 'Hhhh', 'abcd', 'EAWQaaa'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})

df13['istitle']=df13.company_code.apply(lambda x:x.istitle())

df13

	company_code	date_of_sale	sale_amount	istitle
0	ABCD	12/05/2002	12348.5	False
1	EFGF	16/02/1999	233331.2	False
2	Hhhh	25/09/1998	22.5	True
3	abcd	12/02/2022	2566552.0	False
4	EAWQaaa	15/09/1997	23.0	False

Write a Pandas program to check whether only space is present in a given column of a DataFrame

df14 = pd.DataFrame({
    'company_code': ['Abcd','EFGF ', '  ', 'abcd', ' '],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})

df14['isspace']=df14.company_code.apply(lambda x:x.isspace())
#Alternative
#df14['isspace']=df14.company_code.str.isspace()

df14

	company_code	date_of_sale	sale_amount	isspace
0	Abcd	12/05/2002	12348.5	False
1	EFGF	16/02/1999	233331.2	False
2		25/09/1998	22.5	True
3	abcd	12/02/2022	2566552.0	False
4		15/09/1997	23.0	True

Write a Pandas program to get the length of the string present of a given column in a DataFrame.

df15 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'skfsalf', 'sdfslew', 'safsdf'],
    'date_of_sale ': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]})

df15['length_company']=df15.company_code.str.len()

df15

	company_code	date_of_sale	sale_amount	length_company
0	Abcd	12/05/2002	12348.5	4
1	EFGF	16/02/1999	233331.2	4
2	skfsalf	25/09/1998	22.5	7
3	sdfslew	12/02/2022	2566552.0	7
4	safsdf	15/09/1997	23.0	6

Write a Pandas program to get the length of the integer of a given column in a DataFrame.

df16 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'skfsalf', 'sdfslew', 'safsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

df16['sale_amnt_len']=df16.sale_amount.astype('str').str.len()

df16

	company_code	date_of_sale	sale_amount	sale_amnt_len
0	Abcd	12/05/2002	12348.5	7
1	EFGF	16/02/1999	233331.2	8
2	skfsalf	25/09/1998	22.5	4
3	sdfslew	12/02/2022	2566552.0	9
4	safsdf	15/09/1997	23.0	4

Write a Pandas program to check if a specified column starts with a specified string in a DataFrame.

df17 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

df17['company_code_starts_with']=df17.company_code.str.startswith('ze')

df17

	company_code	date_of_sale	sale_amount	company_code_starts_with
0	Abcd	12/05/2002	12348.5	False
1	EFGF	16/02/1999	233331.2	False
2	zefsalf	25/09/1998	22.5	True
3	sdfslew	12/02/2022	2566552.0	False
4	zekfsdf	15/09/1997	23.0	True

Write a Pandas program to swap the cases of a specified character column in a given DataFrame.

df18 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

df18['swapcase_cc']=df18.company_code.str.swapcase()
df18

	company_code	date_of_sale	sale_amount	swapcase_cc
0	Abcd	12/05/2002	12348.5	aBCD
1	EFGF	16/02/1999	233331.2	efgf
2	zefsalf	25/09/1998	22.5	ZEFSALF
3	sdfslew	12/02/2022	2566552.0	SDFSLEW
4	zekfsdf	15/09/1997	23.0	ZEKFSDF

Write a Pandas program to convert a specified character column in upper/lower cases in a given DataFrame.

df19 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

df19['upper_cc']=df19.company_code.str.upper()
df19['lower_cc']=df19.company_code.str.lower()
df19

	company_code	date_of_sale	sale_amount	upper_cc	lower_cc
0	Abcd	12/05/2002	12348.5	ABCD	abcd
1	EFGF	16/02/1999	233331.2	EFGF	efgf
2	zefsalf	25/09/1998	22.5	ZEFSALF	zefsalf
3	sdfslew	12/02/2022	2566552.0	SDFSLEW	sdfslew
4	zekfsdf	15/09/1997	23.0	ZEKFSDF	zekfsdf

Write a Pandas program to convert a specified character column in title case in a given DataFrame.

df20 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

df20['title_cc']=df20.company_code.str.title()
df20

	company_code	date_of_sale	sale_amount	title_cc
0	Abcd	12/05/2002	12348.5	Abcd
1	EFGF	16/02/1999	233331.2	Efgf
2	zefsalf	25/09/1998	22.5	Zefsalf
3	sdfslew	12/02/2022	2566552.0	Sdfslew
4	zekfsdf	15/09/1997	23.0	Zekfsdf

Write a Pandas program to replace arbitrary values with other values in a given DataFrame.

df21 = pd.DataFrame({
    'company_code': ['A','B', 'C', 'D', 'A'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

#replacing A with C
df21.replace('A','C')

	company_code	date_of_sale	sale_amount
0	C	12/05/2002	12348.5
1	B	16/02/1999	233331.2
2	C	25/09/1998	22.5
3	D	12/02/2022	2566552.0
4	C	15/09/1997	23.0

Write a Pandas program to replace more than one value with other values in a given DataFrame.

df22 = pd.DataFrame({
    'company_code': ['A','B', 'C', 'D', 'A'],
    'date_of_sale': ['12/05/2002','16/02/1999','25/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

df22.replace(['A','C'],['X','Y'])

	company_code	date_of_sale	sale_amount
0	X	12/05/2002	12348.5
1	B	16/02/1999	233331.2
2	Y	25/09/1998	22.5
3	D	12/02/2022	2566552.0
4	X	15/09/1997	23.0

Write a Pandas program to split a string of a column of a given DataFrame into multiple columns.

df23 = pd.DataFrame({
    'name': ['Alberto  Franco','Gino Ann Mcneill','Ryan  Parkes', 'Eesha Artur Hinton', 'Syed  Wharton'],
    'date_of_birth ': ['17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
    'age': [18.5, 21.2, 22.5, 22, 23]
})

df23['First Middle Last'.split()]=df23.name.str.split(expand=True)
df23

	name	date_of_birth	age	First	Middle	Last
0	Alberto Franco	17/05/2002	18.5	Alberto	Franco	None
1	Gino Ann Mcneill	16/02/1999	21.2	Gino	Ann	Mcneill
2	Ryan Parkes	25/09/1998	22.5	Ryan	Parkes	None
3	Eesha Artur Hinton	11/05/2002	22.0	Eesha	Artur	Hinton
4	Syed Wharton	15/09/1997	23.0	Syed	Wharton	None

Write a Pandas program to extract email from a specified column of string type of a given DataFrame

df24 = pd.DataFrame({
    'name_email': ['Alberto Franco af@gmail.com','Gino Mcneill gm@yahoo.com','Ryan Parkes rp@abc.io', 'Eesha Hinton', 'Gino Mcneill gm@github.com']
    })

df24['email']=df24.name_email.apply(lambda x:re.findall('\S+@+\S+',str(x)))
df24

	name_email	email
0	Alberto Franco af@gmail.com	[af@gmail.com]
1	Gino Mcneill gm@yahoo.com	[gm@yahoo.com]
2	Ryan Parkes rp@abc.io	[rp@abc.io]
3	Eesha Hinton	[]
4	Gino Mcneill gm@github.com	[gm@github.com]

df24['email2']=df24.name_email.apply(lambda x:','.join(re.findall(r'[\w\.-]+@[\w\.-]+',str(x))))
df24

	name_email	email	email2
0	Alberto Franco af@gmail.com	[af@gmail.com]	af@gmail.com
1	Gino Mcneill gm@yahoo.com	[gm@yahoo.com]	gm@yahoo.com
2	Ryan Parkes rp@abc.io	[rp@abc.io]	rp@abc.io
3	Eesha Hinton	[]
4	Gino Mcneill gm@github.com	[gm@github.com]	gm@github.com

Write a Pandas program to extract hash attached word from twitter text from the specified column of a given DataFrame.

df25 = pd.DataFrame({
    'tweets': ['#Obama says goodbye','Retweets for #cash','A political endorsement in #Indonesia', '1 dog = many #retweets', 'Just a simple #egg']
    })

df25['hashtags']=df25.tweets.apply(lambda x: ' '.join(re.findall(r'(?<=#)\w+',x)))
df25

	tweets	hashtags
0	#Obama says goodbye	Obama
1	Retweets for #cash	cash
2	A political endorsement in #Indonesia	Indonesia
3	1 dog = many #retweets	retweets
4	Just a simple #egg	egg

Write a Pandas program to extract word mention someone in tweets using @ from the specified column of a given DataFrame.

df26 = pd.DataFrame({
    'tweets': ['@Obama says goodbye','Retweets for @cash','A political endorsement in @Indonesia', '1 dog = many #retweets', 'Just a simple #egg']
    })

df26['at_words']=df26.tweets.apply(lambda x:' '.join(re.findall(r'(?<=@)\w+',x)))
df26

	tweets	at_words
0	@Obama says goodbye	Obama
1	Retweets for @cash	cash
2	A political endorsement in @Indonesia	Indonesia
3	1 dog = many #retweets
4	Just a simple #egg

Write a Pandas program to extract only number from the specified column of a given DataFrame.

df27 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['7277 Surrey Ave.','920 N. Bishop Ave.','9910 Golden Star St.', '25 Dunbar St.', '17 West Livingston Court']
    })

df27['numeric_in_address']=df27.address.apply(lambda x:','.join(re.findall(r'[0-9]+',x)))
df27

	company_code	address	numeric_in_address
0	c0001	7277 Surrey Ave.	7277
1	c0002	920 N. Bishop Ave.	920
2	c0003	9910 Golden Star St.	9910
3	c0003	25 Dunbar St.	25
4	c0004	17 West Livingston Court	17

Write a Pandas program to extract only phone number from the specified column of a given DataFrame.

df28 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'company_phone_no': ['Company1-Phone no. 4695168357','Company2-Phone no. 8088729013','Company3-Phone no. 6204658086', 'Company4-Phone no. 5159530096', 'Company5-Phone no. 9037952371']
    })

re.findall(r'\b\d{10}\b',df28.company_phone_no[1])

['8088729013']

df28['phone_number']=df28.company_phone_no.apply(lambda x:','.join(re.findall(r'\b\d{10}\b',x)))
df28

	company_code	company_phone_no	phone_number
0	c0001	Company1-Phone no. 4695168357	4695168357
1	c0002	Company2-Phone no. 8088729013	8088729013
2	c0003	Company3-Phone no. 6204658086	6204658086
3	c0003	Company4-Phone no. 5159530096	5159530096
4	c0004	Company5-Phone no. 9037952371	9037952371

Write a Pandas program to extract year between 1800 to 2200 from the specified column of a given DataFrame.

df29 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
    })

df29.year.str.split(expand=True)

	0	1
0	year	1800
1	year	1700
2	year	2300
3	year	1900
4	year	2200

re.findall(r'\b([1][8-9]+\d{02}|[2][1]\d{2}|2200)\b','year 1800 year 1700 year 2300 year 1900 year 2200 year 1711')

['1800', '1900', '2200']

df29['year_btw']=df29.year.apply(lambda x:','.join(re.findall(r'\b([1][8-9]+\d{02}|[2][1]\d{2}|2200)\b',x)))
df29

	company_code	year	year_btw
0	c0001	year 1800	1800
1	c0002	year 1700
2	c0003	year 2300
3	c0003	year 1900	1900
4	c0004	year 2200	2200

Write a Pandas program to extract only non alphanumeric characters from the specified column of a given DataFrame.

df30 = pd.DataFrame({
    'company_code': ['c0001#','c00@0^2','$c0003', 'c0003', '&c0004'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
    })

re.findall(r'[^A-Za-z0-9]','c00@0^2')

['@', '^']

df30['special_char']=df30.company_code.apply(lambda x:','.join(re.findall(r'[^a-z0-9]',x)))
df30

	company_code	year	special_char
0	c0001#	year 1800	#
1	c00@0^2	year 1700	@,^
2	$c0003	year 2300	$
3	c0003	year 1900
4	&c0004	year 2200	&

Write a Pandas program to extract only punctuations from the specified column of a given DataFrame.

df31 = pd.DataFrame({
    'company_code': ['c0001.','c000,2','c0003', 'c0003#', 'c0004,'],
    'year': ['year 1800','year 1700','year 2300', 'year 1900', 'year 2200']
    })

df31['puct']=df31.company_code.apply(lambda x:re.findall(r'[^\w\s]',x))
df31

	company_code	year	puct
0	c0001.	year 1800	[.]
1	c000,2	year 1700	[,]
2	c0003	year 2300	[]
3	c0003#	year 1900	[#]
4	c0004,	year 2200	[,]

Write a Pandas program to remove repetitive characters from the specified column of a given DataFrame.

df32 = pd.DataFrame({
    'text_code': ['t0001.','t0002','t0003', 't0004'],
    'text_lang': ['She livedd a long life.', 'How oold is your father?', 'What is tthe problem?','TThhis desk is used by Tom.']
    })

def rep_char(str1):
    tchr = str1.group(0)
    if len(tchr) > 1:
        return tchr[0:1] # can change the value here on repetition
def unique_char(rep, sent_text):
    convert = re.sub(r'(\w)\1+', rep, sent_text)
    return convert
df32['normal_text']=df32['text_lang'].apply(lambda x : unique_char(rep_char,x))
df32

	text_code	text_lang	normal_text
0	t0001.	She livedd a long life.	She lived a long life.
1	t0002	How oold is your father?	How old is your father?
2	t0003	What is tthe problem?	What is the problem?
3	t0004	TThhis desk is used by Tom.	This desk is used by Tom.

Write a Pandas program to extract numbers greater than 940 from the specified column of a given DataFrame.

df33 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['7277 Surrey Ave.1111','920 N. Bishop Ave.','9910 Golden Star St.', '1025 Dunbar St.', '1700 West Livingston Court']
    })
df33

	company_code	address
0	c0001	7277 Surrey Ave.1111
1	c0002	920 N. Bishop Ave.
2	c0003	9910 Golden Star St.
3	c0003	1025 Dunbar St.
4	c0004	1700 West Livingston Court

df33['num_great']=df33.address.apply(lambda x:','.join(re.findall(r'([9][4-9]\d|[1-9]\d{3})',x)))
df33

	company_code	address	num_great
0	c0001	7277 Surrey Ave.1111	7277,1111
1	c0002	920 N. Bishop Ave.
2	c0003	9910 Golden Star St.	991
3	c0003	1025 Dunbar St.	1025
4	c0004	1700 West Livingston Court	1700

Write a Pandas program to extract numbers less than 100 from the specified column of a given DataFrame.

df34 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['72 Surrey Ave.11','92 N. Bishop Ave.','9910 Golden Star St.', '102 Dunbar St.', '17 West Livingston Court']
    })

df34['num_less']=df34.address.apply(lambda x:re.findall(r'(\b\d{2}\b)',x))
df34

	company_code	address	num_less
0	c0001	72 Surrey Ave.11	[72, 11]
1	c0002	92 N. Bishop Ave.	[92]
2	c0003	9910 Golden Star St.	[]
3	c0003	102 Dunbar St.	[]
4	c0004	17 West Livingston Court	[17]

Write a Pandas program to check whether two given words present in a specified column of a given DataFrame.

df35 = pd.DataFrame({
    'company_code': ['c0001','c0002','c0003', 'c0003', 'c0004'],
    'address': ['9910 Surrey Ave.','92 N. Bishop Ave.','9910 Golden Star Ave.', '102 Dunbar St.', '17 West Livingston Court']
    })

df35['check_words']=df35.address.apply(lambda x:re.findall(r'(?=.*Ave.)(?=.*9910).*',x))
df35

	company_code	address	check_words
0	c0001	9910 Surrey Ave.	[9910 Surrey Ave.]
1	c0002	92 N. Bishop Ave.	[]
2	c0003	9910 Golden Star Ave.	[9910 Golden Star Ave.]
3	c0003	102 Dunbar St.	[]
4	c0004	17 West Livingston Court	[]

Write a Pandas program to extract valid date (format: mm-dd-yyyy) from a given column of a given DataFrame.

df36 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'sale_amount': [12348.5, 233331.2, 22.5, 2566552.0, 23.0]
})

df36['valid_dates']=df36.date_of_sale.apply(lambda x:re.findall(r'(1[0-2]|0[1-9])/(3[0-1]|[1-2][0-9]|0\d)/(\d{4})',x))
df36

	company_code	date_of_sale	sale_amount	valid_dates
0	Abcd	12/05/2002	12348.5	[(12, 05, 2002)]
1	EFGF	16/02/1999	233331.2	[]
2	zefsalf	05/09/1998	22.5	[(05, 09, 1998)]
3	sdfslew	12/02/2022	2566552.0	[(12, 02, 2022)]
4	zekfsdf	15/09/1997	23.0	[]

Write a Pandas program to extract only words from a given column of a given DataFrame.

df37 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Ave.','92 N. Bishop Ave.','9910 Golden Star Ave.', '102 Dunbar St.', '17 West Livingston Court']
})

re.findall(r'\S+[a-zA-Z]+\S+','92 N. Bishop Ave.')

['Bishop', 'Ave.']

df37['only_words']=df37.address.apply(lambda x:' '.join(re.findall(r'\b[^\d\W]+\b',x)))
df37

	company_code	date_of_sale	address	only_words
0	Abcd	12/05/2002	9910 Surrey Ave.	Surrey Ave
1	EFGF	16/02/1999	92 N. Bishop Ave.	N Bishop Ave
2	zefsalf	05/09/1998	9910 Golden Star Ave.	Golden Star Ave
3	sdfslew	12/02/2022	102 Dunbar St.	Dunbar St
4	zekfsdf	15/09/1997	17 West Livingston Court	West Livingston Court

Write a Pandas program to extract the sentences where a specific word is present in a given column of a given DataFrame.

df38 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.', '17 West Livingston Court']
})

word='Avenue'
df38['Ext_Sent']=df38.address.apply(lambda x:re.findall(r'([^.]*'+word+'[^.]*)',x))
df38

	company_code	date_of_sale	address	Ext_Sent
0	Abcd	12/05/2002	9910 Surrey Avenue	[9910 Surrey Avenue]
1	EFGF	16/02/1999	92 N. Bishop Avenue	[ Bishop Avenue]
2	zefsalf	05/09/1998	9910 Golden Star Avenue	[9910 Golden Star Avenue]
3	sdfslew	12/02/2022	102 Dunbar St.	[]
4	zekfsdf	15/09/1997	17 West Livingston Court	[]

Write a Pandas program to extract the unique sentences from a given column of a given DataFrame.

df39 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue\n9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.\n102 Dunbar St.', '17 West Livingston Court']
})
df39

	company_code	date_of_sale	address
0	Abcd	12/05/2002	9910 Surrey Avenue\n9910 Surrey Avenue
1	EFGF	16/02/1999	92 N. Bishop Avenue
2	zefsalf	05/09/1998	9910 Golden Star Avenue
3	sdfslew	12/02/2022	102 Dunbar St.\n102 Dunbar St.
4	zekfsdf	15/09/1997	17 West Livingston Court

df39['unique_Statements']=df39.address.apply(lambda x:re.findall(r'(?sm)(^[^\r\n]+$)(?!.*^\1$)', x))
df39

	company_code	date_of_sale	address	unique_Statements
0	Abcd	12/05/2002	9910 Surrey Avenue\n9910 Surrey Avenue	[9910 Surrey Avenue]
1	EFGF	16/02/1999	92 N. Bishop Avenue	[92 N. Bishop Avenue]
2	zefsalf	05/09/1998	9910 Golden Star Avenue	[9910 Golden Star Avenue]
3	sdfslew	12/02/2022	102 Dunbar St.\n102 Dunbar St.	[102 Dunbar St.]
4	zekfsdf	15/09/1997	17 West Livingston Court	[17 West Livingston Court]

re.findall(r'(?sm)(^[^\r\n]+$)(?!.*^\1$)', '9910 Surrey Avenue\n9910 Surrey Avenue')

['9910 Surrey Avenue']

Write a Pandas program to extract words starting with capital words from a given column of a given DataFrame

df40 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey Avenue','92 N. Bishop Avenue','9910 Golden Star Avenue', '102 Dunbar St.', '17 West Livingston Court']
})

df40['star_Capital']=df40.address.apply(lambda x:",".join(re.findall(r'\b[A-Z]+\S+',x)))
df40

	company_code	date_of_sale	address	star_Capital
0	Abcd	12/05/2002	9910 Surrey Avenue	Surrey,Avenue
1	EFGF	16/02/1999	92 N. Bishop Avenue	N.,Bishop,Avenue
2	zefsalf	05/09/1998	9910 Golden Star Avenue	Golden,Star,Avenue
3	sdfslew	12/02/2022	102 Dunbar St.	Dunbar,St.
4	zekfsdf	15/09/1997	17 West Livingston Court	West,Livingston,Court

Write a Pandas program to remove the html tags within the specified column of a given DataFrame.

df41 = pd.DataFrame({
    'company_code': ['Abcd','EFGF', 'zefsalf', 'sdfslew', 'zekfsdf'],
    'date_of_sale': ['12/05/2002','16/02/1999','05/09/1998','12/02/2022','15/09/1997'],
    'address': ['9910 Surrey <b>Avenue</b>','92 N. Bishop Avenue','9910 <br>Golden Star Avenue', '102 Dunbar <i></i>St.', '17 West Livingston Court']
})

re.sub('<.*?>','','9910 Surrey <b>Avenue</b>')

'9910 Surrey Avenue'

df41['removed_html']=df41.address.apply(lambda x:re.sub(r'<.*?>','',x))
df41

	company_code	date_of_sale	address	removed_html
0	Abcd	12/05/2002	9910 Surrey <b>Avenue</b>	9910 Surrey Avenue
1	EFGF	16/02/1999	92 N. Bishop Avenue	92 N. Bishop Avenue
2	zefsalf	05/09/1998	9910 <br>Golden Star Avenue	9910 Golden Star Avenue
3	sdfslew	12/02/2022	102 Dunbar <i></i>St.	102 Dunbar St.
4	zekfsdf	15/09/1997	17 West Livingston Court	17 West Livingston Court

Share on

Twitter Facebook Google+ LinkedIn

Rohit Kumar

Python Pandas - String and Regular Expression

Exercise - Pandas String and Regular Expression

Share on

You May Also Enjoy

Python Pandas - Joining and merging DataFrame

Python Pandas - DataFrame

Feature Engineering 7 - Outliers and Its impact on ML usecases

Feature Engineering 6 - Handling Imbalanced Dataset