{"id":3798,"date":"2023-09-15T13:51:09","date_gmt":"2023-09-15T18:51:09","guid":{"rendered":"https:\/\/fgiasson.com\/blog\/?p=3798"},"modified":"2023-09-18T08:16:20","modified_gmt":"2023-09-18T13:16:20","slug":"profiling-python-code-in-jupyter-while-doing-literate-programming-with-nbdev","status":"publish","type":"post","link":"https:\/\/fgiasson.com\/blog\/index.php\/2023\/09\/15\/profiling-python-code-in-jupyter-while-doing-literate-programming-with-nbdev\/","title":{"rendered":"Profiling Python Code in Jupyter while doing Literate Programming with nbdev"},"content":{"rendered":"\n<p>As you may know if you followed this blog in the last few weeks, I started to experiment doing <a href=\"https:\/\/fgiasson.com\/blog\/index.php\/2023\/08\/28\/what-is-literate-programming-why\/\">literate programming<\/a> in <a href=\"https:\/\/fgiasson.com\/blog\/index.php\/2023\/08\/30\/literate-programming-in-python-using-nbdev\/\">Python using nbdev<\/a>. This means that most of the Python code I do today is first written in a Jupyter Notebook (in VSCode), and eventually get their ways into a <code>.py<\/code> module file.<\/p>\n<p>Often time, I like to profile a function here and there to better understand where execution time is spent. I do this in my normal development process, without thinking about early optimization, but just to better understand how things works at that time.<\/p>\n<p>This week I wanted to understand what would be the easiest way to quickly profile a function written in a Jupyter Notebook, without having to tangle the code blocks and work at the level of the&nbsp;<code>.py<\/code> module.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Line Profiler<\/h2>\n\n\n\n<p>The solution that worked best for me with my current workflow is to use the <a href=\"https:\/\/github.com\/pyutils\/line_profiler\">line_profiler<\/a> Python library. I won&#8217;t go in details about how it works internally, but I will just show an example of how it can be used and expose the results.<\/p>\n<p>Let&#8217;s start with the code. Here is a piece of code that I am currently working on, that I will release most likely next week, which is related to a small experiment that I am doing on the side.<\/p>\n<p>What this code does is to read a RSS or Atom feed, from the local file system, parse it, and returns a&nbsp;<code>feed<\/code>&nbsp;<code>namedtuple<\/code> and a list of&nbsp;<code>articles<\/code>&nbsp;<code>namedtuple<\/code>. Subsequently, those will be used down the road to easily get into a SQLite database using&nbsp;<code>executemany()<\/code>.<\/p>\n<p>Each of those block are individual code block within the notebook, with explanatory text in between, which I omitted here.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>from line_profiler import profile\n\n@profile\ndef detect_language(text: str):\n    &quot;&quot;&quot;Detect the language of a given text&quot;&quot;&quot;\n\n    # remove all HTML tags from text\n    text = re.sub(&#39;&lt;[^&lt;]+?&gt;&#39;, &#39;&#39;, text)\n\n    # remove all HTML entities from text\n    text = re.sub(&#39;&[^;]+;&#39;, &#39;&#39;, text)\n\n    # remove all extra spaces\n    text = &#39; &#39;.join(text.split())\n\n    # return if the text is too short\n    if len(text) &lt; 128:\n        return &#39;&#39;\n\n    # limit the text to 4096 characters to speed up the \n    # language detection processing\n    text = text[:4096]\n\n    try:\n        lang = detect(text)\n    except:\n        # if langdetect returns an errors because it can&#39;t read the charset, \n        # simply return an empty string to indicate that we can&#39;t detect\n        # the language\n        return &#39;&#39;\n\n    return lang<\/code><\/pre><\/div>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>Feed = namedtuple(&#39;Feed&#39;, [&#39;id&#39;, &#39;url&#39;, &#39;title&#39;, &#39;description&#39;, &#39;lang&#39;, &#39;feed_type&#39;])\nArticle = namedtuple(&#39;Article&#39;, [&#39;feed&#39;, &#39;url&#39;, &#39;title&#39;, &#39;content&#39;, &#39;creation_date&#39;, &#39;lang&#39;])<\/code><\/pre><\/div>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>def parse_feed(feed_path: str, feed_id: str):\n    parsed = feedparser.parse(feed_path)\n\n    feed_title = parsed.feed.get(&#39;title&#39;, &#39;&#39;)\n    feed_description = parsed.feed.get(&#39;description&#39;, &#39;&#39;)\n\n    feed = Feed(feed_id,\n                parsed.feed.get(&#39;link&#39;, &#39;&#39;),\n                feed_title, \n                feed_description,\n                detect_language(feed_title + feed_description),\n                parsed.get(&#39;version&#39;, &#39;&#39;))\n\n    articles = []\n    for entry in parsed.entries:\n        article_title = entry.get(&#39;title&#39;, &#39;&#39;)\n        article_content = entry.description if &#39;description&#39; in entry else entry.content if &#39;content&#39; in entry else &#39;&#39;\n        articles.append(Article(entry.get(&#39;link&#39;, &#39;&#39;),\n                                feed_id,\n                                article_title,\n                                article_content,\n                                entry.published if &#39;published&#39; in entry else datetime.datetime.now(),\n                                detect_language(article_title + article_content)))\n    return feed, articles<\/code><\/pre><\/div>\n\n\n\n<p>Let&#8217;s say that we want to profile the&nbsp;<code>detect_language()<\/code> function when calling the&nbsp;<code>parse_feed()<\/code> function. To do this, the first thing we did is to decorate the&nbsp;<code>detect_language()<\/code> function with the&nbsp;<code>@profile<\/code> decorator from&nbsp;<code>from line_profiler import profile<\/code>. Once this is done, we have to load the&nbsp;<code>line_profiler<\/code> external library using the&nbsp;<code>%load_ext<\/code> magic command in Jupyter. To do this, we simply have to create the following Python code block and execute the cell to load the module in the current running environment:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>%load_ext line_profiler<\/code><\/pre><\/div>\n\n\n\n<p>Once it is loaded, we can create another Python code block that will execute the <code>%lprun<\/code> command which is specific to Jupyter:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-python\" data-lang=\"Python\"><code>%lprun -f detect_language parse_feed(&#39;\/Users\/frederickgiasson\/.swfp\/feeds\/https---fgiasson-com-blog-index-php-feed-\/13092023\/feed.xml&#39;, &#39;https---fgiasson-com-blog-index-php-feed-&#39;)<\/code><\/pre><\/div>\n\n\n\n<p>Once this cell is executed,&nbsp;<code>line_profiler<\/code> will be executed and the profiling of the&nbsp;<code>detect_language()<\/code> function will occurs. Once finished, the following output will appears in the notebook:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-plain\"><code>Timer unit: 1e-09 s\n\nTotal time: 0.215358 s\nFile: \/var\/folders\/pz\/ntz31j490w950b6gn2g0j3nc0000gn\/T\/ipykernel_65374\/1039422716.py\nFunction: detect_language at line 3\n\nLine #      Hits         Time  Per Hit   % Time  Line Contents\n==============================================================\n     3                                           @profile\n     4                                           def detect_language(text: str):\n     5                                               &quot;&quot;&quot;Detect the language of a given text&quot;&quot;&quot;\n     6                                           \n     7                                               # remove all HTML tags from text\n     8        11     136000.0  12363.6      0.1      text = re.sub(&#39;&lt;[^&lt;]+?&gt;&#39;, &#39;&#39;, text)\n     9                                           \n    10                                               # remove all HTML entities from text\n    11        11      78000.0   7090.9      0.0      text = re.sub(&#39;&[^;]+;&#39;, &#39;&#39;, text)\n    12                                           \n    13                                               # remove all extra spaces\n    14        11     118000.0  10727.3      0.1      text = &#39; &#39;.join(text.split())\n    15                                           \n    16                                               # return if the text is too short\n    17        11      15000.0   1363.6      0.0      if len(text) &lt; 128:\n    18         1          0.0      0.0      0.0          return &#39;&#39;\n    19                                           \n    20                                               # limit the text to 4096 characters to speed up the \n    21                                               # language detection processing\n    22        10      12000.0   1200.0      0.0      text = text[:4096]\n    23                                           \n    24        10       6000.0    600.0      0.0      try:\n    25        10  214980000.0    2e+07     99.8          lang = detect(text)\n    26                                               except:\n    27                                                   # if langdetect returns an errors because it can&#39;t read the charset, \n    28                                                   # simply return an empty string to indicate that we can&#39;t detect\n    29                                                   # the language\n    30                                                   return &#39;&#39;\n    31                                           \n    32        10      13000.0   1300.0      0.0      return lang<\/code><\/pre><\/div>\n\n\n\n<p>As we can see, most of the time spent is used detecting the language using&nbsp;<code>langdetect.<\/code><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>It is as simple as that thanks to\u00a0<code>line_profiler<\/code> which is just simple, effective and well integrated in Jupyter. This is perfect for quickly profiling some code on the fly.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As you may know if you followed this blog in the last few weeks, I started to experiment doing literate programming in Python using nbdev. This means that most of the Python code I do today is first written in a Jupyter Notebook (in VSCode), and eventually get their ways into a .py module file. [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[277],"tags":[318,274,314,313],"class_list":["post-3798","post","type-post","status-publish","format-standard","hentry","category-literate-programming","tag-jupyter","tag-literateprogramming","tag-nbdev","tag-python"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3798","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=3798"}],"version-history":[{"count":7,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3798\/revisions"}],"predecessor-version":[{"id":3805,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3798\/revisions\/3805"}],"wp:attachment":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=3798"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=3798"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=3798"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}