As you may know if you followed this blog in the last few weeks, I started to experiment doing literate programming in Python using nbdev. This means that most of the Python code I do today is first written in a Jupyter Notebook (in VSCode), and eventually get their ways into a .py
module file.
Often time, I like to profile a function here and there to better understand where execution time is spent. I do this in my normal development process, without thinking about early optimization, but just to better understand how things works at that time.
This week I wanted to understand what would be the easiest way to quickly profile a function written in a Jupyter Notebook, without having to tangle the code blocks and work at the level of the .py
module.
Line Profiler
The solution that worked best for me with my current workflow is to use the line_profiler Python library. I won’t go in details about how it works internally, but I will just show an example of how it can be used and expose the results.
Let’s start with the code. Here is a piece of code that I am currently working on, that I will release most likely next week, which is related to a small experiment that I am doing on the side.
What this code does is to read a RSS or Atom feed, from the local file system, parse it, and returns a feed
namedtuple
and a list of articles
namedtuple
. Subsequently, those will be used down the road to easily get into a SQLite database using executemany()
.
Each of those block are individual code block within the notebook, with explanatory text in between, which I omitted here.
from line_profiler import profile
@profile
def detect_language(text: str):
"""Detect the language of a given text"""
# remove all HTML tags from text
text = re.sub('<[^<]+?>', '', text)
# remove all HTML entities from text
text = re.sub('&[^;]+;', '', text)
# remove all extra spaces
text = ' '.join(text.split())
# return if the text is too short
if len(text) < 128:
return ''
# limit the text to 4096 characters to speed up the
# language detection processing
text = text[:4096]
try:
lang = detect(text)
except:
# if langdetect returns an errors because it can't read the charset,
# simply return an empty string to indicate that we can't detect
# the language
return ''
return lang
Feed = namedtuple('Feed', ['id', 'url', 'title', 'description', 'lang', 'feed_type'])
Article = namedtuple('Article', ['feed', 'url', 'title', 'content', 'creation_date', 'lang'])
def parse_feed(feed_path: str, feed_id: str):
parsed = feedparser.parse(feed_path)
feed_title = parsed.feed.get('title', '')
feed_description = parsed.feed.get('description', '')
feed = Feed(feed_id,
parsed.feed.get('link', ''),
feed_title,
feed_description,
detect_language(feed_title + feed_description),
parsed.get('version', ''))
articles = []
for entry in parsed.entries:
article_title = entry.get('title', '')
article_content = entry.description if 'description' in entry else entry.content if 'content' in entry else ''
articles.append(Article(entry.get('link', ''),
feed_id,
article_title,
article_content,
entry.published if 'published' in entry else datetime.datetime.now(),
detect_language(article_title + article_content)))
return feed, articles
Let’s say that we want to profile the detect_language()
function when calling the parse_feed()
function. To do this, the first thing we did is to decorate the detect_language()
function with the @profile
decorator from from line_profiler import profile
. Once this is done, we have to load the line_profiler
external library using the %load_ext
magic command in Jupyter. To do this, we simply have to create the following Python code block and execute the cell to load the module in the current running environment:
%load_ext line_profiler
Once it is loaded, we can create another Python code block that will execute the %lprun
command which is specific to Jupyter:
%lprun -f detect_language parse_feed('/Users/frederickgiasson/.swfp/feeds/https---fgiasson-com-blog-index-php-feed-/13092023/feed.xml', 'https---fgiasson-com-blog-index-php-feed-')
Once this cell is executed, line_profiler
will be executed and the profiling of the detect_language()
function will occurs. Once finished, the following output will appears in the notebook:
Timer unit: 1e-09 s
Total time: 0.215358 s
File: /var/folders/pz/ntz31j490w950b6gn2g0j3nc0000gn/T/ipykernel_65374/1039422716.py
Function: detect_language at line 3
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 @profile
4 def detect_language(text: str):
5 """Detect the language of a given text"""
6
7 # remove all HTML tags from text
8 11 136000.0 12363.6 0.1 text = re.sub('<[^<]+?>', '', text)
9
10 # remove all HTML entities from text
11 11 78000.0 7090.9 0.0 text = re.sub('&[^;]+;', '', text)
12
13 # remove all extra spaces
14 11 118000.0 10727.3 0.1 text = ' '.join(text.split())
15
16 # return if the text is too short
17 11 15000.0 1363.6 0.0 if len(text) < 128:
18 1 0.0 0.0 0.0 return ''
19
20 # limit the text to 4096 characters to speed up the
21 # language detection processing
22 10 12000.0 1200.0 0.0 text = text[:4096]
23
24 10 6000.0 600.0 0.0 try:
25 10 214980000.0 2e+07 99.8 lang = detect(text)
26 except:
27 # if langdetect returns an errors because it can't read the charset,
28 # simply return an empty string to indicate that we can't detect
29 # the language
30 return ''
31
32 10 13000.0 1300.0 0.0 return lang
As we can see, most of the time spent is used detecting the language using langdetect.
Conclusion
It is as simple as that thanks to line_profiler
which is just simple, effective and well integrated in Jupyter. This is perfect for quickly profiling some code on the fly.