Optimizing large Excel files with Pandas & Dask
Excel is still a “go-to” choice for table calculation of small, medium, and large datasets. However, it gets tedious when the dataset gets too huge — say a million rows and hundreds of columns, as it tends to work slowly, hang, or crash. Handling such humongous datasets is not easy for Excel alone, and that’s where Python comes into the picture. Python helps users overcome the shortcomings of Excel and lets users use more cells than Excel can initially handle.
While working with a client, we needed to optimize Excel file operations to accommodate large amounts of data. Their current dashboard worked well, but there was some delay and lag for the end-users due to its sheer size. Uploading files or doing simple tasks like searching for keywords would take one to two minutes. Our solution to this issue was to perfect their files using Python. There are many libraries to work within Excel in Python. For example, openpyxl, xlsxwriter, pyxlsb, and xlrd, to name a few. But we chose Pandas (Python Data Analysis Library) and Dask because they were sustainable in this case.
Here is what I am planning to talk about in this article -
- File read operations using Pandas & Dask
- Keyword searching operations using Pandas & Dask
- My observations about Pandas & Dask
Let’s discuss them in detail.
File read operations
The file needs to be read to process the data in this case. Optimizing file read operations will save vast amounts of time. Users can read data in files using the default function and pass it to libraries like Dandas, Dask, and PyXML. But that takes a lot of time compared to in-built functionalities.
The below screenshots display the time taken to read Excel/CSV files using Pandas and Dask libraries.
Fig: Pandas reading Excel and CSV files timing
Fig: Dask reading Excel and CSV files timing
Both libraries take a similar time for Excel file reading. Dask is much faster with CSV files as compared to Pandas. But while reading Excel files, we need to use the Pandas DataFrame to read files in Dask.
Reading CSV files takes less time than XLS files, and users can save up to 10–15 seconds without affecting/modifying data types. So, it is used to optimize computations like creating a background job to convert an Excel file to a CSV file once the users upload it.
Keyword searching
Let’s discuss the demo code that’s going to be used for testing. Read these files as a global variable so that objects can be shared between multiprocesses / workers.
Pandas:
import pandas as pd
from pandasql import sqldf
import multiprocessing
import time
print("starting...")
fstart = time.time()
df = pd.read_csv('testfile22.csv')
print(f"File loading time: {str(time.time()- fstart)}")
columns = df.columns
lebels = ['row_num', 'Date', 'Headline', 'URL', 'Hit Sentence', 'Influencer']df['Hit Sentence'] = df['Hit Sentence'].str.lower()
df['Hit Sentence'] = df['Hit Sentence'].astype(str)
df['row_num'] = range(1, len(df) + 1)
df = df.astype(str)def find_word(word, selected_col='Hit Sentence'):
fwdf = sqldf(f'''SELECT row_num, Date, Headline, URL, Influencer, "{selected_col}" FROM df WHERE "{selected_col}" LIKE "% {word} %"''')
fwdf['word_count'] = fwdf[selected_col].str.count(word)
return fwdf
if __name__ == '__main__':
search_words = ['career courses', 'continue education', 'continued education', 'continues education', 'course',
'courses', 'coursework', 'educational program', 'Google scholarship', 'grant', 'grants',
'hybrid learning', 'in-house education', 'in-person programming', 'job certification',
'job certification program', 'job certification programs', 'job program', 'job programs',
'lifelong education', 'lifelong learning', 'ongoing education', 'online program', 'online seminar',
'online teaching', 'orientation', 'orientations', 'Pell grant', 'Pell grants', 'scholarship',
'self-help', 'self-help tools', 'skills course', 'work grant', 'work grants', 'advice',
'apprentice', 'apprenticeship', 'apprenticeships ', 'apprenticeship program', 'coach', 'coached',
'coaches', 'coaching', 'counsel ', 'counseling', 'remote coaching', 'feedback', 'gain experience',
'guide', 'guidance', 'guiding', 'instruct', 'instruction', 'invest', 'invested', 'investing',
'investment', 'investments', 'invest in', 'co-invest', 'co-invested', 'co-investing', 'mentor',
'mentors ', 'mentoring', 'mentorship ', 'assure', 'assured', 'assurances', 'balance',
'balanced ', 'balancing ', 'before personal needs', 'both mom and teaching assistant',
'come in for only part of the week', 'come in for part of the week', 'comfortable',
'comfort of my home', 'complacent', 'fewer hours', 'harmony', 'harmonious', 'hobby', 'hobbies',
'juggle', 'juggling', 'lifestyle', 'manage', 'managing', 'more time for hobbies', 'nutrition',
'personal life', 'personal time', 'personal priorities', 'quality of life', 'reduce hours',
'reduced hours', 'reducing hours', 'shorter hours'] pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
start_cal = time.time()
res = pool.map(find_word, search_words)
print(f"Word calculation took: {str(time.time() - start_cal)}")
pool.close()
print("finished...")
# pip install "dask[datagrame]"
# pip install dask-sql
import time
import json
import dask.dataframe as dd
from distributed import Client
from dask.delayed import delayed
import pandas as pd
print("starting....")
fstart = time.time()
parts = delayed(pd.read_csv)("all.csv")
df = dd.from_delayed(parts)print(len(df))
print(f"File loading time: {str(time.time()- fstart)}")
df.head() # check first rows
df = df.astype(str)columns = df.columns
result_dict = {}
def find_word(word, selected_col='Hit Sentence'):
print(word)
fwdf = df[df[selected_col].str.contains(f" {word} | {word}-")]
fwdf['word_count'] = fwdf[selected_col].str.count(f" {word} ")
fwdf['word_count2'] = fwdf[selected_col].str.count(f" {word}-")
fwdf['word_count_total'] = fwdf['word_count'] + fwdf['word_count2'] lebels = ['Date', 'Headline', 'URL', 'Hit Sentence', 'Influencer', 'word_count', 'word_count2', 'word_count_total'] return word, fwdf[lebels]
if __name__ == '__main__':
client = Client()
search_words = ["air", "stop", "good", "job", "hospital", "covid", "career courses", "continue education",
"continued education", "continues education", "course",
"courses", "coursework", "educational program", "Google scholarship", "grant", "grants",
"hybrid learning", "job certification",
"job certification program", "job certification programs", "job program", "job programs",
"lifelong education", "lifelong learning", "ongoing education", "online program", "online seminar",
"online teaching", "orientation", "orientations", "Pell grant", "Pell grants", "scholarship",
"skills course", "work grant", "work grants", "advice",
"apprentice", "apprenticeship", "apprenticeships ", "apprenticeship program", "coach", "coached",
"coaches", "coaching", "counsel ", "counseling", "remote coaching", "feedback", "gain experience",
"guide", "guidance", "guiding", "instruct", "instruction", "invest", "invested", "investing",
"investment", "investments", "invest in", "mentor",
"mentors ", "mentoring", "mentorship ", "assure", "assured", "assurances", "balance",
"balanced ", "balancing ", "before personal needs", "both mom and teaching assistant",
"come in for only part of the week", "come in for part of the week", "comfortable",
"comfort of my home", "complacent", "fewer hours", "harmony", "harmonious", "hobby", "hobbies",
"juggle", "juggling", "lifestyle", "manage", "managing", "more time for hobbies", "nutrition",
"personal life", "personal time", "personal priorities", "quality of life", "reduce hours",
"reduced hours", "reducing hours", "shorter hours",
# "co-invest", "co-invested", "co-investing", "self-help", "self-help tools", , "in-house education", "in-person programming"
]
file_data = json.loads(open('allwords.json', 'r').read())
search_words = file_data['keywords'] selected_col = 'Hit Sentence'
print(f"search keywords len: {str(len(search_words))}") cstart = time.time() x = client.map(find_word, search_words)
res_dict = {}
for val in x:
w, rdf = val.result()
res_dict[w] = rdf print(f"Word calculation time: {str(time.time() - cstart)}")
print(len(res_dict)) print("end....")
For searching a keyword, the users must provide the column name they intend to search in and the number of keywords, i.e., 100, 500, 1000, 2500, and so on. For example…Read More