As you may know if you followed this blog in the last few weeks, I started to experiment doing literate programming in Python using nbdev. This means that most of the Python code I do today is first written in a Jupyter Notebook (in VSCode), and eventually get their ways into a
.py module file.
Often time, I like to profile a function here and there to better understand where execution time is spent. I do this in my normal development process, without thinking about early optimization, but just to better understand how things works at that time.
This week I wanted to understand what would be the easiest way to quickly profile a function written in a Jupyter Notebook, without having to tangle the code blocks and work at the level of the
The solution that worked best for me with my current workflow is to use the line_profiler Python library. I won’t go in details about how it works internally, but I will just show an example of how it can be used and expose the results.
Let’s start with the code. Here is a piece of code that I am currently working on, that I will release most likely next week, which is related to a small experiment that I am doing on the side.
What this code does is to read a RSS or Atom feed, from the local file system, parse it, and returns a
namedtuple and a list of
namedtuple. Subsequently, those will be used down the road to easily get into a SQLite database using
Each of those block are individual code block within the notebook, with explanatory text in between, which I omitted here.
from line_profiler import profile @profile def detect_language(text: str): """Detect the language of a given text""" # remove all HTML tags from text text = re.sub('<[^<]+?>', '', text) # remove all HTML entities from text text = re.sub('&[^;]+;', '', text) # remove all extra spaces text = ' '.join(text.split()) # return if the text is too short if len(text) < 128: return '' # limit the text to 4096 characters to speed up the # language detection processing text = text[:4096] try: lang = detect(text) except: # if langdetect returns an errors because it can't read the charset, # simply return an empty string to indicate that we can't detect # the language return '' return lang
Feed = namedtuple('Feed', ['id', 'url', 'title', 'description', 'lang', 'feed_type']) Article = namedtuple('Article', ['feed', 'url', 'title', 'content', 'creation_date', 'lang'])
def parse_feed(feed_path: str, feed_id: str): parsed = feedparser.parse(feed_path) feed_title = parsed.feed.get('title', '') feed_description = parsed.feed.get('description', '') feed = Feed(feed_id, parsed.feed.get('link', ''), feed_title, feed_description, detect_language(feed_title + feed_description), parsed.get('version', '')) articles =  for entry in parsed.entries: article_title = entry.get('title', '') article_content = entry.description if 'description' in entry else entry.content if 'content' in entry else '' articles.append(Article(entry.get('link', ''), feed_id, article_title, article_content, entry.published if 'published' in entry else datetime.datetime.now(), detect_language(article_title + article_content))) return feed, articles
Let’s say that we want to profile the
detect_language() function when calling the
parse_feed() function. To do this, the first thing we did is to decorate the
detect_language() function with the
@profile decorator from
from line_profiler import profile. Once this is done, we have to load the
line_profiler external library using the
%load_ext magic command in Jupyter. To do this, we simply have to create the following Python code block and execute the cell to load the module in the current running environment:
Once it is loaded, we can create another Python code block that will execute the
%lprun command which is specific to Jupyter:
%lprun -f detect_language parse_feed('/Users/frederickgiasson/.swfp/feeds/https---fgiasson-com-blog-index-php-feed-/13092023/feed.xml', 'https---fgiasson-com-blog-index-php-feed-')
Once this cell is executed,
line_profiler will be executed and the profiling of the
detect_language() function will occurs. Once finished, the following output will appears in the notebook:
Timer unit: 1e-09 s Total time: 0.215358 s File: /var/folders/pz/ntz31j490w950b6gn2g0j3nc0000gn/T/ipykernel_65374/1039422716.py Function: detect_language at line 3 Line # Hits Time Per Hit % Time Line Contents ============================================================== 3 @profile 4 def detect_language(text: str): 5 """Detect the language of a given text""" 6 7 # remove all HTML tags from text 8 11 136000.0 12363.6 0.1 text = re.sub('<[^<]+?>', '', text) 9 10 # remove all HTML entities from text 11 11 78000.0 7090.9 0.0 text = re.sub('&[^;]+;', '', text) 12 13 # remove all extra spaces 14 11 118000.0 10727.3 0.1 text = ' '.join(text.split()) 15 16 # return if the text is too short 17 11 15000.0 1363.6 0.0 if len(text) < 128: 18 1 0.0 0.0 0.0 return '' 19 20 # limit the text to 4096 characters to speed up the 21 # language detection processing 22 10 12000.0 1200.0 0.0 text = text[:4096] 23 24 10 6000.0 600.0 0.0 try: 25 10 214980000.0 2e+07 99.8 lang = detect(text) 26 except: 27 # if langdetect returns an errors because it can't read the charset, 28 # simply return an empty string to indicate that we can't detect 29 # the language 30 return '' 31 32 10 13000.0 1300.0 0.0 return lang
As we can see, most of the time spent is used detecting the language using
It is as simple as that thanks to
line_profiler which is just simple, effective and well integrated in Jupyter. This is perfect for quickly profiling some code on the fly.