Hidden gems: 14 Python libraries too good to overlook

Parsing, image processing, web crawling, GUI creation -- these little-known Python libraries have you covered


Hidden gems: 14 Python libraries too good to overlook

Want a good reason for Python's smashing success as a language? Look no further than its massive collection of libraries, both native and third party. With so many libraries out there, though, it's no surprise some get crowded out and don't quite grab the attention they deserve. Plus, programmers who work exclusively in one domain don't always know about the goodies that may be available to them through libraries created for other kinds of work.

Here are 10 Python libraries you may have overlooked but are definitely worth your attention. It's time to give one of these hidden gems some love.

[ Download the Python megaguide, a hands-on, in-depth look at 13 Python web frameworks and six Python development toolkits. | Keep up with hot topics in programming with InfoWorld's App Dev Report newsletter. ]



What it's for: Image processing without the pain.

Why it's great: Most Pythonistas who have performed image processing ought to be familiar with PIL (Python Imaging Library), but PIL is riddled with shortcomings and limitations, and is updated infrequently. Pillow, however, aims to be both easier to use than PIL and code-compatible with PIL via minimal changes. Extensions are included for talking to both native Windows imaging functions and Python's Tcl/Tk-backed Tkinter GUI package.

Version 4 of Pillow, released at the beginning of 2017, adds a bevy of changes, mostly internal, but also to update Pillow to use the latest versions of dependent libraries like FreeType and OpenJpeg. Pillow is available through GitHub or the PyPI repository.



What it's for: Turn a console-based Python program into one that sports a platform-native GUI.

Why it's great: Presenting people, especially rank-and-file users, with a command-line application is among the fastest ways to reduce its use. Few beyond the hardcore like figuring out what options to pass or in what order. Gooey takes arguments expected by the argparse library and presents them to users as a GUI form, with all options labeled and displayed with appropriate controls (such as a drop-down for a multi-option argument, and so on). Very little additional coding -- a single include and a single decorator -- is needed to make it work, assuming you're already using argparse.



What it's for: A tiny ORM that supports SQLite, MySQL, and PostgreSQL, with many extensions.

Why it's great: ORMs don't have the greatest reputation; some people would rather leave schema modeling on the database side and be done with it. But a well-constructed, unobtrusive ORM can be a godsend for developers who don't want to touch databases, and for those who don't want something as full-blown as SQL Alchemy, Peewee is a great fit. Peewee models are easy to construct, connect, and manipulate, and many common query-manipulation functions (such as pagination) are built right in. More features are available as add-ons, including extensions for other databases, testing tools, and -- a feature even ORM haters might learn to love -- a schema migration system.



What it's for: Screen scraping and web crawling.

Why it's great: Scrapy keeps the whole process of scraping simple. Create a class that defines the kind of item(s) you want scraped and write some rules about how to extract that data from the page; the results are exported as JSON, XML, CSV, or any number of other formats. The collected data can be saved raw, or it can be sanitized as it's imported. Plus, Scrapy can be extended to allow many other behaviors, such as how to handle logging into a website or handling session cookies. Images, too, can be automatically siphoned up by Scrapy and associated with the scraped content.


Apache Libcloud

What it's for: Accessing multiple cloud providers through a single, consistent, and unified API.

Why it's great: If the above description of Apache Libcloud doesn't make you clap your hands for joy, nothing will. Cloud providers all love to do things their way -- sometimes subtly, sometimes not -- so having a unified mechanism for dealing with dozens of providers and the associated methods for manipulating their resources is a boon. APIs are available for compute, storage, load balancing, and DNS, with support for both the 2.x and 3.x flavor of Python. For those using the PyPy version of Python for the additional performance, PyPy is supported as well.



What it's for: A framework for creating video games in Python.

Why it's great: If you think anyone outside of the game development world would ever bother with such a framework, think again. Pygame provides a handy option to work with many GUI-oriented behaviors that might otherwise require a lot of heavy lifting: drawing canvas and sprite graphics; dealing with multichannel sound; handling windows and click events; collision detections; and so on. Not every app, and not even every GUI app, will benefit from being built with Pygame, but take a closer look at what it provides and you might be surprised.



What it's for: Scientific computing and mathematical work, including statistics, linear algebra, matrix math, financial operations, and tons more.

Why it's great: Quants and bean counters already know about NumPy and love it, but the range of applications for NumPy outside math 'n' stats is broader than you think. For example, it's one of the easiest, most flexible ways to add support for multidimensional arrays to Python, which newcomers from other languages often complain about. If you want the total and complete Python science-and-math enchilada, though, get the SciPy library and environment, which includes NumPy as standard issue. For more sophisticated data analysis built on top of NumPy, check out Pandas.



What it's for: Calling any external program, in a subprocess, and returning the results to a Python program -- but with the same syntax as if the program in question were a native Python function.

Why it's great: On any Posix-compliant system, Sh is a godsend. It means the entire range of command-line programs available on those platforms can be used Pythonically. Not only do you no longer have to reinvent the wheel (why implement ping when it's right there in the OS?), but you no longer have to struggle with how to add that functionality elegantly to your application. Be warned: There's no sanitization of parameters passed through this library. Be sure never to pass along raw user input.



What it's for: Programmatically creating and manipulating Microsoft Word .docx files.

Why it's great: In theory, it should be easy to write scripts that create and manipulate XML-style Microsoft Word documents. In practice, it isn't that simple, no thanks to all of the internal complexities of the .docx format. Python-docx lets you do an end run around all that, by providing a high-level, programmatic way to create .docx files.

Text, images, styles, and document sections can all be added and changed via the library's APIs. The library also lets you change an existing document. It's a great way to perform changes that you'd need another Python library to achieve or to avoid dealing with Word's own built-in automation functions. Some features still aren't supported yet. For instance, you can't add or change headers and footnotes -- but Python-docx does its best to preserve such things even if they can't be manipulated.



What it's for: A common, Pythonic interface to any filesystem -- any filesystem.

Why it's great: The fundamental idea behind PyFilesystem couldn't be simpler: "In the same way that file objects abstract a single file," says the library's docs, "FS objects abstract an entire filesystem." This doesn't mean only on-disk filesystems; it also means FTP directories, in-memory filesystems, filesystems for locations defined by the OS (such as the user directory), and even combinations of the above overlaid onto each other.

Aside from making it easier to write cross-platform code that manipulates files, PyFilesystem obviates the need to cobble together things from disparate parts of the standard library, mainly os and io. It also provides utilities that one might otherwise have to roll from scratch, like a tool for printing console-friendly tree views of a filesystem.



What it's for: Easy creation of e-books.

Why it's great: Creating e-books typically requires wrangling together one variety of command line tool or other. EbookLib provides management tools and APIs to simplify the process. It works with epub version 2 and 3 files, with Kindle support under development.

Provide the images and the text (the latter in HTML format), and you can assemble those pieces into an e-book complete with chapters, nested table of contents entries, images, HTML markup, and so on. Cover, spine, and stylesheet data are all supported, too. A plug-in system allows third parties to extend the library's behaviors.

If you don't need something as full-bodied as EbookLib, there's Mkepub, which provides basic e-book assembly functionality in a library that's only a few kilobytes, but includes features like the ability to add images to a document. One minor drawback of Mkepub is that it requires Jinja2, which in turn requires the MarkupSafe library.



What it's for: Accelerating Python code by compiling it to C.

Why it's great: Python is wonderfully convenient, but that convenience comes at the cost of performance. C is the gold standard for runtime performance (barring assembly), but can be unwieldy to work with. Cython taps into the best of both worlds -- not only by providing a convenient option for Python to access libraries in C, but allowing Python code to be transformed into high-performance C code. It's used widely in scientific computing, but it can be used for judicious speedups of many kinds of applications.

The best part about this transformation process is that you don't have to do it all at once. You can start with Python code as-is and compile it with Cython to obtain a modest performance boost. For further speedups, you can decorate variables and functions with type annotations, a process no more complicated than using Python's PEP 484 type-hinting system (although Cython's syntax is different).



What it's for: Robust support for print-style debugging in Python.

Why it's great: There's one simple way to debug in most any language: Insert in-line "print" statements. Python is no exception, and that's exactly how many people do ad hoc debugging, even in large projects. As easy as print-debugging is, though, it's hard to get useful results within large, sprawling, multimodule projects.

Behold provides a toolkit for contextual debugging via print statements. It allows you to impose a uniform look on the output, tag the results so that they can be sorted through by way of searches or filters, and provide contexts across modules so that functions that originate in one module can be debugged properly in another. Behold handles many common Python-specific scenarios like printing an object's internal dictionary, unveiling nested attributes, storing and reusing results for comparison at other points during the debugging process, and many more.