PreambleThis is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
SummaryThe majority of machine learning projects that you read about or work on are built around batch processes. The model is trained, and then validated, and then deployed, with each step being a discrete and isolated task. Unfortunately, the real world is rarely static, leading to concept drift and model failures. River is a framework for building streaming machine learning projects that can constantly adapt to new information. In this episode Max Halford explains how the project works, why you might (or might not) want to consider streaming ML, and how to get started building with River.
Announcements* Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. * Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started! * Your host is Tobias Macey and today I’m interviewing Max Halford about River, a Python toolkit for streaming and online machine learning
Interview* Introduction * How did you get involved in machine learning? * Can you describe what River is and the story behind it? * What is "online" machine learning? + What are the practical differences with batch ML? + Why is batch learning so predominant? + What are the cases where someone would want/need to use online or streaming ML? * The prevailing pattern for batch ML model lifecycles is to train, deploy, monitor, repeat. What does the ongoing maintenance for a streaming ML model look like? + Concept drift is typically due to a discrepancy between the data used to train a model and the actual data being observed. How does the use of online learning affect the incidence of drift? * Can you describe how the River framework is implemented? + How have the design and goals of the project changed since you started working on it? * How do the internal representations of the model differ from batch learning to allow for incremental updates to the model state? * In the documentation you note the use of Python dictionaries for state management and the flexibility offered by that choice. What are the benefits and potential pitfalls of that decision? * Can you describe the process of using River to design, implement, and validate a streaming ML model? + What are the operational requirements for deploying and serving the model once it has been developed? * What are some of the challenges that users of River might run into if they are coming from a batch learning background? * What are the most interesting, innovative, or unexpected ways that you have seen River used? * What are the most interesting, unexpected, or challenging lessons that you have learned while working on River? * When is River the wrong choice? * What do you have planned for the future of River?
Contact Info* Email * @halford_max on Twitter * MaxHalford on GitHub
Parting Question* From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements* Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links* River * scikit-multiflow * Federated Machine Learning * Hogwild! Google Paper * Chip Huyen concept drift blog post * Dan Crenshaw Berkeley Clipper MLOps * Robustness Principle * NY Taxi Dataset * RiverTorch * River Public Roadmap * Beaver tool for deploying online models * Prodigy ML human in the loop labeling
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
PreambleThis is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
SummaryDeep learning is a revolutionary category of machine learning that accelerates our ability to build powerful inference models. Along with that power comes a great deal of complexity in determining what neural architectures are best suited to a given task, engineering features, scaling computation, etc. Predibase is building on the successes of the Ludwig framework for declarative deep learning and Horovod for horizontally distributing model training. In this episode CTO and co-founder of Predibase, Travis Addair, explains how they are reducing the burden of model development even further with their managed service for declarative and low-code ML and how they are integrating with the growing ecosystem of solutions for the full ML lifecycle.
Announcements* Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great! * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! * Your host is Tobias Macey and today I’m interviewing Travis Addair about Predibase, a low-code platform for building ML models in a declarative format
Interview* Introduction * How did you get involved in machine learning? * Can you describe what Predibase is and the story behind it? * Who is your target audience and how does that focus influence your user experience and feature development priorities? * How would you describe the semantic differences between your chosen terminology of "declarative ML" and the "autoML" nomenclature that many projects and products have adopted? + Another platform that launched recently with a promise of "declarative ML" is Continual. How would you characterize your relative strengths? * Can you describe how the Predibase platform is implemented? + How have the design and goals of the product changed as you worked through the initial implementation and started working with early customers? + The operational aspects of the ML lifecycle are still fairly nascent. How have you thought about the boundaries for your product to avoid getting drawn into scope creep while providing a happy path to delivery? * Ludwig is a core element of your platform. What are the other capabilities that you are layering around and on top of it to build a differentiated product? * In addition to the existing interfaces for Ludwig you created a new language in the form of PQL. What was the motivation for that decision? + How did you approach the semantic and syntactic design of the dialect? + What is your vision for PQL in the space of "declarative ML" that you are working to define? * Can you describe the available workflows for an individual or team that is using Predibase for prototyping and validating an ML model? + Once a model has been deemed satisfactory, what is the path to production? * How are you approaching governance and sustainability of Ludwig and Horovod while balancing your reliance on them in Predibase? * What are some of the notable investments/improvements that you have made in Ludwig during your work of building Predibase? * What are the most interesting, innovative, or unexpected ways that you have seen Predibase used? * What are the most interesting, unexpected, or challenging lessons that you have learned while working on Predibase? * When is Predibase the wrong choice? * What do you have planned for the future of Predibase?
Contact Info* LinkedIn * tgaddair on GitHub * @travisaddair on Twitter
Parting Question* From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements* Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. The Machine Learning Podcast helps you go from idea to production with machine learning. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links* Predibase * Horovod * Ludwig + Podcast.__init__ Episode * Support Vector Machine * Hadoop * Tensorflow * Uber Michaelangelo * AutoML * Spark ML Lib * Deep Learning * PyTorch * Continual + Data Engineering Podcast Episode * Overton * Kubernetes * Ray * Nvidia Triton * Whylogs + Data Engineering Podcast Episode * Weights and Biases * MLFlow * Comet * Confusion Matrices * dbt + Data Engineering Podcast Episode * Torchscript * Self-supervised Learning
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Machine learning has the potential to transform industries and revolutionize business capabilities, but only if the models are reliable and robust. Because of the fundamental probabilistic nature of machine learning techniques it can be challenging to test and validate the generated models. The team at Deepchecks understands the widespread need to easily and repeatably check and verify the outputs of machine learning models and the complexity involved in making it a reality. In this episode Shir Chorev and Philip Tannor explain how they are addressing the problem with their open source deepchecks library and how you can start using it today to build trust in your machine learning applications.
Building an ML model is getting easier than ever, but it is still a challenge to get that model in front of the people that you built it for. Baseten is a platform that helps you quickly generate a full stack application powered by your model. You can easily create a web interface and APIs powered by the model you created, or a pre-trained model from their library. In this episode Tuhin Srivastava, co-founder of Basten, explains how the platform empowers data scientists and ML engineers to get their work in production without having to negotiate for help from their application development colleagues.
Starting a new project is always exciting and full of possibility, until you have to set up all of the repetitive boilerplate. Fortunately there are useful project templates that eliminate that drudgery. PyScaffold goes above and beyond simple template repositories, and gives you a toolkit for different application types that are packed with best practices to make your life easier. In this episode Florian Wilhelm shares the story behind PyScaffold, how the templates are designed to reduce friction when getting a new project off the ground, and how you can extend it to suit your needs. Stop wasting time with boring boilerplate and get straight to the fun part with PyScaffold!
Application configuration is a deceptively complex problem. Everyone who is building a project that gets used more than once will end up needing to add configuration to control aspects of behavior or manage connections to other systems and services. At first glance it seems simple, but can quickly become unwieldy. Bruno Rocha created Dynaconf in an effort to provide a simple interface with powerful capabilities for managing settings across environments with a set of strong opinions. In this episode he shares the story behind the project, how its design allows for adapting to various applications, and how you can start using it today for your own projects.
Software is eating the world, but that code has to have hardware to execute the instructions. Most people, and many software engineers, don't have a proper understanding of how that hardware functions. Charles Petzold wrote the book "Code: The Hidden Language of Computer Hardware and Software" to make this a less opaque subject. In this episode he discusses what motivated him to revise that work in the second edition and the additional details that he packed in to explore the functioning of the CPU.
The generation, distribution, and consumption of energy is one of the most critical pieces of infrastructure for the modern world. With the rise of renewable energy there is an accompanying need for systems that can respond in real-time to the availability and demand for electricity. FlexMeasures is an open source energy management system that is designed to integrate a variety of inputs intelligently allocate energy resources to reduce waste in your home or grid. In this episode Nicolas Höning explains how the project is implemented, how it is being used in his startup Seita, and how you can try it out for your own energy needs.
Your ability to build and maintain a software project is tempered by the strength of the team that you are working with. If you are in a position of leadership, then you are responsible for the growth and maintenance of that team. In this episode Jigar Desai, currently the SVP of engineering at Sisu Data, shares his experience as an engineering leader over the past several years and the useful insights he has gained into how to build effective engineering teams.
Working on hardware projects often has significant friction involved when compared to pure software. Brian Pugh enjoys tinkering with microcontrollers, but his "weekend projects" often took longer than a weekend to complete, so he created Belay. In this episode he explains how Belay simplifies the interactions involved in developing for MicroPython boards and how you can use it to speed up your own experimentation.
Static typing versus dynamic typing is one of the oldest debates in software development. In recent years a number of dynamic languages have worked toward a middle ground by adding support for type hints. Python's type annotations have given rise to an ecosystem of tools that use that type information to validate the correctness of programs and help identify potential bugs. At Instagram they created the Pyre project with a focus on speed to allow for scaling to huge Python projects. In this episode Shannon Zhu discusses how it is implemented, how to use it in your development process, and how it compares to other type checkers in the Python ecosystem.
Every software project is subject to a series of decisions and tradeoffs. One of the first decisions to make is which programming language to use. For companies where their product is software, this is a decision that can have significant impact on their overall success. In this episode Sean Knapp discusses the languages that his team at Ascend use for building a service that powers complex and business critical data workflows. He also explains his motivation to standardize on Python for all layers of their system to improve developer productivity.
Writing code is only one piece of creating good software. Code reviews are an important step in the process of building applications that are maintainable and sustainable. In this episode On Freund shares his thoughts on the myriad purposes that code reviews serve, as well as exploring some of the patterns and anti-patterns that grow up around a seemingly simple process.
Quality assurance in the software industry has become a shared responsibility in most organizations. Given the rapid pace of development and delivery it can be challenging to ensure that your application is still working the way it's supposed to with each release. In this episode Jonathon Wright discusses the role of quality assurance in modern software teams and how automation can help.
The goal of every software team is to get their code into production without breaking anything. This requires establishing a repeatable process that doesn't introduce unnecessary roadblocks and friction. In this episode Ronak Rahman discusses the challenges that development teams encounter when trying to build and maintain velocity in their work, the role that access to infrastructure plays in that process, and how to build automation and guardrails for everyone to take part in the delivery process.
Every startup begins with an idea, but that won't get you very far without testing the feasibility of that idea. A common practice is to build a Minimum Viable Product (MVP) that addresses the problem that you are trying to solve and working with early customers as they engage with that MVP. In this episode Tony Pavlovych shares his thoughts on Python's strengths when building and launching that MVP and some of the potential pitfalls that businesses can run into on that path.
Application architectures have been in a constant state of evolution as new infrastructure capabilities are introduced. Virtualization, cloud, containers, mobile, and now web assembly have each introduced new options for how to build and deploy software. Recognizing the transformative potential of web assembly, Matt Butcher and his team at Fermyon are investing in tooling and services to improve the developer experience. In this episode he explains the opportunity that web assembly offers to all language communities, what they are building to power lightweight server-side microservices, and how Python developers can get started building and contributing to this nascent ecosystem.
As your code scales beyond a trivial level of complexity and sophistication it becomes difficult or impossible to know everything that it is doing. The flow of logic and data through your software and which parts are taking the most time are impossible to understand without help from your tools. VizTracer is the tool that you will turn to when you need to know all of the execution paths that are being exercised and which of those paths are the most expensive. In this episode Tian Gao explains why he created VizTracer and how you can use it to gain a deeper familiarity with the code that you are responsible for maintaining.
Analysis of streaming data in real time has long been the domain of big data frameworks, predominantly written in Java. In order to take advantage of those capabilities from Python requires using client libraries that suffer from impedance mis-matches that make the work harder than necessary. Bytewax is a new open source platform for writing stream processing applications in pure Python that don't have to be translated into foreign idioms. In this episode Bytewax founder Zander Matheson explains how the system works and how to get started with it today.
Building a fully functional web application has been growing in complexity along with the growing popularity of javascript UI frameworks such as React, Vue, Angular, etc. Users have grown to expect interactive experiences with dynamic page updates, which leads to duplicated business logic and complex API contracts between the server-side application and the Javascript front-end. To reduce the friction involved in writing and maintaining a full application Sam Willis created Tetra, a framework built on top of Django that embeds the Javascript logic into the Python context where it is used. In this episode he explains his design goals for the project, how it has helped him build applications more rapidly, and how you can start using it to build your own projects today.
Virtually everything that you interact with on a daily basis and many other things that make modern life possible were designed and modeled in software called CAD or Computer-Aided Design. These programs are advanced suites with graphical editing environments tailored to domain experts in areas such as mechanical engineering, electrical engineering, architecture, etc. While the UI-driven workflow is more accessible, it isn't scalable which opens the door to code-driven workflows. In this episode Jeremy Wright discusses the design, uses, and benefits of the CadQuery framework for building 3D CAD models entirely in Python.
Building any software project is going to require relying on dependencies that you and your team didn't write or maintain, and many of those will have dependencies of their own. This has led to a wide variety of potential and actual issues ranging from developer ergonomics to application security. In order to provide a higher degree of confidence in the optimal combinations of direct and transitive dependencies a team at Red Hat started Project Thoth. In this episode Fridolín Pokorný explains how the Thoth resolver uses multiple signals to find the best combination of dependency versions to ensure compatibility and avoid known security issues.
Most developers have encountered code completion systems and rely on them as part of their daily work. They allow you to stay in the flow of programming, but have you ever stopped to think about how they work? In this episode Meredydd Luff takes us behind the scenes to dig into the mechanics of code completion engines and how you can customize them to fit your particular use case.
Russell Keith-Magee is an accomplished engineer and a fixture of the Python community. His work on the Beeware suite of projects is one of the most ambitious undertakings in the ecosystem and unfailingly forward-looking. With his recent transition to working for Anaconda he is now able to dedicate his full focus to the effort. In this episode he reflects on the journey that he has taken so far, how Beeware is helping to address some of the threats to Python's long term viability, and how he envisions its future in light of the recent release of PyScript, an in-browser runtime for Python.
Digital cameras and the widespread availability of smartphones has allowed us all to generate massive libraries of personal photographs. Unfortunately, now we are all left to our own devices of how to manage them. While cloud services such as iPhotos and Google Photos are convenient, they aren't always affordable and they put your pictures under the control of large companies with their own agendas. LibrePhotos is an open source and self-hosted alternative to these services that puts you in control of your digital memories. In this episode the maintainer of LibrePhotos, Niaz Faridani-Rad, explains how he got involved with the project, the capabilities that it offers for managing your image library, and how to get your own instance set up to take back control of your pictures.
Investing effectively is largely a game of information access and analysis. This can involve a substantial amount of research and time spent on finding, validating, and acquiring different information sources. In order to reduce the barrier to entry and provide a powerful framework for amateur and professional investors alike Didier Rodrigues Lopes created the OpenBB Terminal. In this episode he explains how a pandemic project that started as an experiment has led to him founding a new company and dedicating his time to growing and improving the project and its community.
The experimentation phase of building a machine learning model requires a lot of trial and error. One of the limiting factors of how many experiments you can try is the length of time required to train the model which can be on the order of days or weeks. To reduce the time required to test different iterations Rolando Garcia Sanchez created FLOR which is a library that automatically checkpoints training epochs and instruments your code so that you can bypass early training cycles when you want to explore a different path in your algorithm. In this episode he explains how the tool works to speed up your experimentation phase and how to get started with it.
Programmers love to automate tedious processes, including refactoring your code. In order to support the creation of code modifications for your Python projects Jimmy Lai created LibCST. It provides a richly typed and high level API for creating and manipulating concrete syntax trees of your source code. In this episode Jimmy Lai and Zsolt Dollenstein explain how it works, some of the linting and automatic code modification utilities that you can build with it and how to get started with using it to maintain your own Python projects.
Communication is a fundamental requirement for any program or application. As the friction involved in deploying code has gone down, the motivation for architecting your system as microservices goes up. This shifts the communication patterns in your software from function calls to network calls. In this episode Idit Levine explains how the Gloo platform that she and her team at Solo have created makes it easier for you to configure and monitor the network topologies for your microservice environments. She also discusses what developers need to know about networking in cloud native environments and how a combination of API gateways and service mesh technologies allow you to more rapidly iterate on your systems.
Cloud native architectures have been gaining prominence for the past few years due to the rising popularity of Kubernetes. This introduces new complications to development workflows due to the need to integrate with multiple services as you build new components for your production systems. In order to reduce the friction involved in developing applications for cloud native environments Michael Schilonka created Gefyra. In this episode he explains how it connects your local machine to a running Kubernetes environment so that you can rapidly iterate on your software in the context of the whole system. He also shares how the Django Hurricane plugin lets your applications work closely with the Kubernetes process model.
Science is founded on the collection and analysis of data. For disciplines that rely on data about the earth the ability to simulate and generate that data has been growing faster than the tools for analysis of that data can keep up with. In order to help scale that capacity for everyone working in geosciences the Pangeo project compiled a reference stack that combines powerful tools into an out-of-the-box solution for researchers to be productive in short order. In this episode Ryan Abernathy and Joe Hamman explain what the Pangeo project really is, how they have integrated a combination of XArray, Dask, and Jupyter to power these analytical workflows, and how it has helped to accelerate research on multidimensional geospatial datasets.
A common piece of advice when starting anything new is to "begin with the end in mind". In order to help the engineers at Wayfair manage the complete lifecycle of their applications Joshua Woodward runs a team that provides tooling and assistance along every step of the journey. In this episode he shares some of the lessons and tactics that they have developed while assisting other engineering teams with starting, deploying, and sunsetting projects. This is an interesting look at the inner workings of large organizations and how they invest in the scaffolding that supports their myriad efforts.
Kubernetes is a framework that aims to simplify the work of running applications in production, but it forces you to adopt new patterns for debugging and resolving issues in your systems. Robusta is aimed at making that a more pleasant experience for developers and operators through pre-built automations, easy debugging, and a simple means of creating your own event-based workflows to find, fix, and alert on errors in production. In this episode Natan Yellin explains how the project got started, how it is architected and tested, and how you can start using it today to keep your Python projects running reliably.
Building a machine learning application is inherently complex. Once it becomes necessary to scale the operation or training of the model, or introduce online re-training the process becomes even more challenging. In order to reduce the operational burden of AI developers Robert Nishihara helped to create the Ray framework that handles the distributed computing aspects of machine learning operations. To support the ongoing development and simplify adoption of Ray he co-founded Anyscale. In this episode he re-joins the show to share how the project, its community, and the ecosystem around it have grown and evolved over the intervening two years. He also explains how the techniques and adoption of machine learning have influenced the direction of the project.
As software projects grow and change it can become difficult to keep track of all of the logical flows. By visualizing the interconnections of function definitions, classes, and their invocations you can speed up the time to comprehension for newcomers to a project, or help yourself remember what you worked on last month. In this episode Scott Rogowski shares his work on Code2Flow as a way to generate a call graph of your programs. He explains how it got started, how it works, and how you can start using it to understand your Python, Ruby, and PHP projects.
One of the most persistent challenges faced by organizations of all sizes is the recording and distribution of institutional knowledge. In technical teams this is exacerbated by the need to incorporate technical review feedback and manage access to data before publishing. When faced with this problem as an early data scientist at AirBnB, Chetan Sharma helped create the Knowledge Repo project as a solution. In this episode he shares the story behind its creation and growth, how and why it was released as open source, and the features that make it a compelling option for your own team's knowledge management journey.
Software development is a complex undertaking due to the number of options available and choices to be made in every stage of the lifecycle. In order to make it more scaleable it is necessary to establish common practices and patterns and introduce strong opinions. One area that can have a huge impact on the productivity of the engineers engaged with a project is the tooling used for building, validating, and deploying changes introduced to the software. In this episode maintainers of the Pants build tool Eric Arellano, Stu Hood, and Andreas Stenius discuss the recent updates that add support for more languages, efforts made to simplify its adoption, and the growth of the community that uses it. They also explore how using Pants as the single entry point for all of your routine tasks allows you to spend your time on the decisions that matter.
It doesn't matter how amazing your application is if you are unable to deliver it to your users. Frustrated with the rampant complexity involved in building and deploying software Vlad A. Ionescu created the Earthly tool to reduce the toil involved in creating repeatable software builds. In this episode he explains the complexities that are inherent to building software projects and how he designed the syntax and structure of Earthly to make it easy to adopt for developers across all language environments. By adopting Earthly you can use the same techniques for building on your laptop and in your CI/CD pipelines.
The process of getting software delivered to an environment where users can interact with it requires many steps along the way. In some cases the journey can require a large number of interdependent workflows that need to be orchestrated across technical and organizational boundaries, making it difficult to know what the current status is. Faced with such a complex delivery workflow the engineers at Ericsson created a message based protocol and accompanying tooling to let the various actors in the process provide information about the events that happened across the different stages. In this episode Daniel Ståhl and Magnus Bäck explain how the Eiffel protocol allows you to build a tooling agnostic visibility layer for your software delivery process, letting you answer all of your questions about what is happening between writing a line of code and your users executing it.
When we are creating applications we spend a significant amount of effort on optimizing the experience of our end users to ensure that they are able to complete the tasks that the system is intended for. A similar effort that we should all consider is optimizing the developer experience for ourselves and other engineers who contribute to the projects that we work on. Adam Johnson recently wrote a book on how to improve the developer experience for Django projects and in this episode he shares some of the insights that he has gained through that project and his work with clients to help you improve the experience that you and your team have when collaborating on software development.
Pandas has grown to be a ubiquitous tool for working with data at every stage. It has become so well known that many people learn Python solely for the purpose of using Pandas. With all of this activity and the long history of the project it can be easy to find misleading or outdated information about how to use it. In this episode Matt Harrison shares his work on the book "Effective Pandas" and some of the best practices and potential pitfalls that you should know for applying Pandas in your own work.
Developers hate wasting effort on manual processes when we can write code to do it instead. Cog is a tool to manage the work of automating the creation of text inside another file by executing arbitrary Python code. In this episode Ned Batchelder shares the story of why he created Cog in the first place, some of the interesting ways that he uses it in his daily work, and the unique challenges of maintaining a project with a small audience and a well defined scope.
Statistical regression models are a staple of predictive forecasts in a wide range of applications. In this episode Matthew Rudd explains the various types of regression models, when to use them, and his work on the book "Regression: A Friendly Guide" to help programmers add regression techniques to their toolbox.
Every software project needs a tool for managing the repetitive tasks that are involved in building, running, and deploying the code. Frustrated with the limitations of tools like Make, Scons, and others Eduardo Schettino created doit to handle task automation in his own work and released it as open source. In this episode he shares the story behind the project, how it is implemented under the hood, and how you can start using it in your own projects to save you time and effort.
Whether we like it or not, advertising is a common and effective way to make money on the internet. In order to support the work being done at Read The Docs they decided to include advertisements on the documentation sites they were hosting, but they didn't want to alienate their users or collect unnecessary information. In this episode David Fischer explains how they built the Ethical Ads network to solve their problem, the technical and business challenges that are involved, and the open source application that they built to power their network.
Podcasts are one of the few mediums in the internet era that are still distributed through an open ecosystem. This has a number of benefits, but it also brings the challenge of making it difficult to find the content that you are looking for. Frustrated by the inability to pick and choose single episodes across various shows for his listening Wenbin Fang started the Listen Notes project to fulfill his own needs. He ended up turning that project into his full time business which has grown into the most full featured podcast search engine on the market. In this episode he explains how he build the Listen Notes application using Python and Django, his work to turn it into a sustainable business, and the various ways that you can build other applications and experiences on top of his API.
Outer space holds a deep fascination for people of all ages, and the key principle in its exploration both near and far is orbital mechanics. Poliastro is a pure Python package for exploring and simulating orbit calculations. In this episode Juan Luis Cano Rodriguez shares the story behind the project, how you can use it to learn more about space travel, and some of the interesting projects that have used it for planning planetary and interplanetary missions.
Deep learning frameworks encourage you to focus on the structure of your model ahead of the data that you are working with. Ludwig is a tool that uses a data oriented approach to building and training deep learning models so that you can experiment faster based on the information that you actually have, rather than spending all of our time manipulating features to make them match your inputs. In this episode Travis Addair explains how Ludwig is designed to improve the adoption of deep learning for more companies and a wider range of users. He also explains how the Horovod framework plugs in easily to allow for scaling your training workflow from your laptop out to a massive cluster of servers and GPUs. The combination of these tools allows for a declarative workflow that starts off easy but gives you full control over the end result.
A lot of time and energy goes into data analysis and machine learning projects to address various goals. Most of the effort is focused on the technical aspects and validating the results, but how much time do you spend on considering the experience of the people who are using the outputs of these projects? In this episode Benn Stancil explores the impact that our technical focus has on the perceived value of our work, and how taking the time to consider what the desired experience will be can lead us to approach our work more holistically and increase the satisfaction of everyone involved.
The true power of artificial intelligence is its ability to work collaboratively with humans. Nate Joens co-founded Structurely to create a conversational AI platform that augments human sales teams to help guide potential customers through the initial steps of the funnel. In this episode he discusses the technical and social considerations that need to be combined for a seamless conversational experience and how he and his team are tackling the problem.
Every machine learning model has to start with feature engineering. This is the process of combining input variables into a more meaningful signal for the problem that you are trying to solve. Many times this process can lead to duplicating code from previous projects, or introducing technical debt in the form of poorly maintained feature pipelines. In order to make the practice more manageable Soledad Galli created the feature-engine library. In this episode she explains how it has helped her and others build reusable transformations that can be applied in a composable manner with your scikit-learn projects. She also discusses the importance of understanding the data that you are working with and the domain in which your model will be used to ensure that you are selecting the right features.
The speed of Python is a subject of constant debate, but there is no denying that for compute heavy work it is not the optimal tool. Rather than rewriting your data oriented applications, or having to rearchitect them, the team at Bodo wrote a compiler that will do the optimization for you. In this episode Ehsan Totoni explains how they are able to translate pure Python into massively parallel processes that are optimized for high performance compute systems.
The world of finance has driven the development of many sophisticated techniques for data analysis. In this episode Paul Stafford shares his experiences working in the realm of risk management for financial exchanges. He discusses the types of risk that are involved, the statistical methods that he has found most useful for identifying strategies to mitigate that risk, and the software libraries that have helped him most in his work.
Machine learning and deep learning techniques are powerful tools for a large and growing number of applications. Unfortunately, it is difficult or impossible to understand the reasons for the answers that they give to the questions they are asked. In order to help shine some light on what information is being used to provide the outputs to your machine learning models Scott Lundberg created the SHAP project. In this episode he explains how it can be used to provide insight into which features are most impactful when generating an output, and how that insight can be applied to make more useful and informed design choices. This is a fascinating and important subject and this episode is an excellent exploration of how to start addressing the challenge of explainability.
Finding new and effective treatments for disease is a complex and time consuming endeavor, requiring a high degree of domain knowledge and specialized equipment. Combining his expertise in machine learning and graph algorithms with is interest in drug discovery Jian Tang created the TorchDrug project to help reduce the amount of time needed to find new candidate molecules for testing. In this episode he explains how the project is being used by machine learning researchers and biochemists to collaborate on finding effective treatments for real-world diseases.
The overwhelming growth of smartphones, smart speakers, and spoken word content has corresponded with increasingly sophisticated machine learning models for recognizing speech content in audio data. Dylan Fox founded Assembly to provide access to the most advanced automated speech recognition models for developers to incorporate into their own products. In this episode he gives an overview of the current state of the art for automated speech recognition, the varying requirements for accuracy and speed of models depending on the context in which they are used, and what is required to build a special purpose model for your own ASR applications.
Reinforcement learning is a branch of machine learning and AI that has a lot of promise for applications that need to evolve with changes to their inputs. To support the research happening in the field, including applications for robotics, Carlo D'Eramo and Davide Tateo created MushroomRL. In this episode they share how they have designed the project to be easy to work with, so that students can use it in their study, as well as extensible so that it can be used by businesses and industry professionals. They also discuss the strengths of reinforcement learning, how to design problems that can leverage its capabilities, and how to get started with MushroomRL for your own work.
A perennial problem of doing data science is that it works great on your laptop, until it doesn't. Another problem is being able to recreate your environment to collaborate on a problem with colleagues. Saturn Cloud aims to help with both of those problems by providing an easy to use platform for creating reproducible environments that you can use to build data science workflows and scale them easily with a managed Dask service. In this episode Julia Signall, head of open source at Saturn Cloud, explains how she is working with the product team and PyData community to reduce the points of friction that data scientists encounter as they are getting their work done.
You've got a machine learning model trained and running in production, but that's only half of the battle. Are you certain that it is still serving the predictions that you tested? Are the inputs within the range of tolerance that you designed? Monitoring machine learning products is an essential step of the story so that you know when it needs to be retrained against new data, or parameters need to be adjusted. In this episode Emeli Dral shares the work that she and her team at Evidently are doing to build an open source system for tracking and alerting on the health of your ML products in production. She discusses the ways that model drift can occur, the types of metrics that you need to track, and what to do when the health of your system is suffering. This is an important and complex aspect of the machine learning lifecycle, so give it a listen and then try out Evidently for your own projects.
Building a machine learning model is a process that requires a lot of iteration and trial and error. For certain classes of problem a large portion of the searching and tuning can be automated. This allows data scientists to focus their time on more complex or valuable projects, as well as opening the door for non-specialists to experiment with machine learning. Frustrated with some of the awkward or difficult to use tools for AutoML, Angela Lin and Jeremy Shih helped to create the EvalML framework. In this episode they share the use cases for automated machine learning, how they have designed the EvalML project to be approachable, and how you can use it for building and training your own models.
Data scientists are tasked with answering challenging questions using data that is often messy and incomplete. Anaconda is on a mission to make the lives of data professionals more manageable through creation and maintenance of high quality libraries and frameworks, the distribution of an easy to use Python distribution and package ecosystem, and high quality training material. In this episode Kevin Goldsmith, CTO of Anaconda, discusses the technical and social challenges faced by data scientists, the ways that the Python ecosystem has evolved to help address those difficulties, and how Anaconda is engaging with the community to provide high quality tools and education for this constantly changing practice.
Analysing networks is a growing area of research in academia and industry. In order to be able to answer questions about large or complex relationships it is necessary to have fast and efficient algorithms that can process the data quickly. In this episode Eugenio Angriman discusses his contributions to the NetworKit library to provide an accessible interface for these algorithms. He shares how he is using NetworKit for his own research, the challenges of working with large and complex networks, and the kinds of questions that can be answered with data that fits on your laptop.
Building a software-as-a-service (SaaS) business is a fairly well understood pattern at this point. When the core of the service is a set of machine learning products it introduces a whole new set of challenges. In this episode Dylan Fox shares his experience building Assembly AI as a reliable and affordable option for automatic speech recognition that caters to a developer audience. He discusses the machine learning development and deployment processes that his team relies on, the scalability and performance considerations that deep learning models introduce, and the user experience design that goes into building for a developer audience. This is a fascinating conversation about a unique cross-section of considerations and how Dylan and his team are building an impressive and useful service.
SQL has gone through many cycles of popularity and disfavor. Despite its longevity it is objectively challenging to work with in a collaborative and composable manner. In order to address these shortcomings and build a new interface for your database oriented workloads Erez Shinan created Preql. It is based on the same relational algebra that inspired SQL, but brings in more robust computer science principles to make it more manageable as you scale in complexity. In this episode he shares his motivation for creating the Preql project, how he has used Python to develop a new language for interacting with database engines, and the challenges of taking on the legacy of SQL as an individual.
When you start working on a data project there are always a variety of unknown factors that you have to explore. One of those is the volume of total data that you will eventually need to handle, and the speed and scale at which it will need to be processed. If you optimize for scale too early then it adds a high barrier to entry due to the complexities of distributed systems, but if you invest in a lot of engineering up front then it can be challenging to refactor for scale. Modin is a project that aims to remove that decision by letting you seamlessly replace your existing Pandas code and scale across CPU cores or across a cluster of machines. In this episode Devin Petersohn explains why he started working on solving this problem, how Modin is architected to allow for a smooth escalation from small to large volumes of data and compute, and how you can start using it today to accelerate your Pandas workflows.
With the rising availability of computation in everyday devices, there has been a corresponding increase in the appetite for voice as the primary interface. To accomodate this desire it is necessary for us to have high quality libraries for being able to process and generate audio data that can make sense of human speech. To facilitate research and industry applications for speech data Mirco Ravanelli and Peter Plantinga are building SpeechBrain. In this episode they explain how it works under the hood, the projects that they are using it for, and how you can get started with it today.
If you are interested in a library for working with graph structures that will also help you learn more about the research and theory behind the algorithms then look no further than graph-tool. In this episode Tiago Peixoto shares his work on graph algorithms and networked data and how he has built graph-tool to help in that research. He explains how it is implemented, how it evolved from a simple command line tool to a full-fledged library, and the benefits that he has found from building a personal project in the open.
Deep learning has largely taken over the research and applications of artificial intelligence, with some truly impressive results. The challenge that it presents is that for reasonable speed and performance it requires specialized hardware, generally in the form of a dedicated GPU (Graphics Processing Unit). This raises the cost of the infrastructure, adds deployment complexity, and drastically increases the energy requirements for training and serving of models. To address these challenges Nir Shavit combined his experiences in multi-core computing and brain science to co-found Neural Magic where he is leading the efforts to build a set of tools that prune dense neural networks to allow them to execute on commodity CPU hardware. In this episode he explains how sparsification of deep learning models works, the potential that it unlocks for making machine learning and specialized AI more accessible, and how you can start using it today.
Brett Cannon has been a long-time contributor to the Python language and community in many ways. In this episode he shares some of his work and thoughts on modernizing the ecosystem around the language. This includes standards for packaging, discovering the true core of the language, and how to make it possible to target mobile and web platforms.
The foundation of every ML model is the data that it is trained on. In many cases you will be working with tabular or unstructured information, but there is a growing trend toward networked, or graph data sets. Benedek Rozemberczki has focused his research and career around graph machine learning applications. In this episode he discusses the common sources of networked data, the challenges of working with graph data in machine learning projects, and describes the libraries that he has created to help him in his work. If you are dealing with connected data then this interview will provide a wealth of context and resources to improve your projects.
The growth of analytics has accelerated the use of SQL as a first class language. It has also grown the amount of collaboration involved in writing and maintaining SQL queries. With collaboration comes the inevitable variation in how queries are written, both structurally and stylistically which can lead to a significant amount of wasted time and energy during code review and employee onboarding. Alan Cruickshank was feeling the pain of this wasted effort first-hand which led him down the path of creating SQLFluff as a linter and formatter to enforce consistency and find bugs in the SQL code that he and his team were working with. In this episode he shares the story of how SQLFluff evolved from a simple hackathon project to an open source linter that is used across a range of companies and fosters a growing community of users and contributors. He explains how it has grown to support multiple dialects of SQL, as well as integrating with projects like DBT to handle templated queries. This is a great conversation about the long detours that are sometimes necessary to reach your original destination and the powerful impact that good tooling can have on team productivity.
Deep learning is gaining an immense amount of popularity due to the incredible results that it is able to offer with comparatively little effort. Because of this there are a number of engineers who are trying their hand at building machine learning models with the wealth of frameworks that are available. Andrew Ferlitsch wrote a book to capture the useful patterns and best practices for building models with deep learning to make it more approachable for newcomers ot the field. In this episode he shares his deep expertise and extensive experience in building and teaching machine learning across many companies and industries. This is an entertaining and educational conversation about how to build maintainable models across a variety of applications.
Unit tests are an important tool to ensure the proper functioning of your application, but writing them can be a chore. Stephan Lukasczyk wants to reduce the monotony of the process for Python developers. As part of his PhD research he created the Pynguin project to automate the creation of unit tests. In this episode he explains the complexity involved in generating useful tests for a dynamic language, how he has designed Pynguin to address the challenges, and how you can start using it today for your own work.
Natural language processing is a powerful tool for extracting insights from large volumes of text. With the growth of the internet and social platforms, and the increasing number of people and communities conducting their professional and personal activities online, the opportunities for NLP to create amazing insights and experiences are endless. In order to work with such a large and growing corpus it has become necessary to move beyond purely statistical methods and embrace the capabilities of deep learning, and transfer learning in particular. In this episode Paul Azunre shares his journey into the application and implementation of transfer learning for natural language processing. This is a fascinating look at the possibilities of emerging machine learning techniques for transforming the ways that we interact with technology.
Machine learning is a tool that has typically been performed on large volumes of data in one place. As more computing happens at the edge on mobile and low power devices, the learning is being federated which brings a new set of challenges. Daniel Beutel co-created the Flower framework to make federated learning more manageable. In this episode he shares his motivations for starting the project, how you can use it for your own work, and the unique challenges and benefits that this emerging model offers. This is a great exploration of the federated learning space and a framework that makes it more approachable.
Data exploration is an important step in any analysis or machine learning project. Visualizing the data that you are working with makes that exploration faster and more effective, but having to remember and write all of the code to build a scatter plot or histogram is tedious and time consuming. In order to eliminate that friction Doris Lee helped create the Lux project, which wraps your Pandas data frame and automatically generates a set of visualizations without you having to lift a finger. In this episode she explains how Lux works under the hood, what inspired her to create it in the first place, and how it can help you create a better end result. The Lux project is a valuable addition to the toolbox of anyone who is doing data wrangling with Pandas.
Any project that is used by more than one person will eventually need to handle permissions for each of those users. It is certainly possible to write that logic yourself, but you'll almost certainly do it wrong at least once. Rather than waste your time fighting with bugs in your authorization code it makes sense to use a well-maintained library that has already made and fixed all of the mistakes so that you don't have to. In this episode Sam Scott shares the Oso framework to give you a clean separation between your authorization policies and your application code. He explains how you can call a simple function to ask if something is allowed, and then manage the complex rules that match your particular needs as a separate concern. He describes the motivation for building a domain specific language based on logic programming for policy definitions, how it integrates with the host language (such as Python), and how you can start using it in your own applications today. This is a must listen even if you never use the project because it is a great exploration of all of the incidental complexity that is involved in permissions management.
Being able to present your ideas is one of the most valuable and powerful skills to have as a professional, regardless of your industry. For software engineers it is especially important to be able to communicate clearly and effectively because of the detail-oriented nature of the work. Unfortunately, many people who work in software are more comfortable in front of the keyboard than a crowd. In this episode Neil Thompson shares his story of being an accidental public speaker and how he is helping other engineers start down the road of being effective presenters. He discusses the benefits for your career, how to build the skills, and how to find opportunities to practice them. Even if you never want to speak at a conference, it's still worth your while to listen to Neil's advice and find ways to level up your presentation and speaking skills.
One of the great promises of computers is that they will make our work faster and easier, so why do we all spend so much time manually copying data from websites, or entering information into web forms, or any of the other tedious tasks that take up our time? As developers our first inclination is to "just write a script" to automate things, but how do you share that with your non-technical co-workers? In this episode Antti Karjalainen, CEO and co-founder of Robocorp, explains how Robotic Process Automation (RPA) can help us all cut down on time-wasting tasks and let the computers do what they're supposed to. He shares how he got involved in the RPA industry, his work with Robot Framework and RPA framework, how to build and distribute bots, and how to decide if a task is worth automating. If you're sick of spending your time on mind-numbing copy and paste then give this episode a listen and then let the robots do the work for you.
When you are writing code it is all to easy to introduce subtle bugs or leave behind unused code. Unused variables, unused imports, overly complex logic, etc. If you are careful and diligent you can find these problems yourself, but isn't that what computers are supposed to help you with? Thankfully Python has a wealth of tools that will work with you to keep your code clean and maintainable. In this episode Anthony Sottile explores Flake8, one of the most popular options for identifying those problematic lines of code. He shares how he became involved in the project and took over as maintainer and explains the different categories of code quality tooling and how Flake8 compares to other static analyzers. He also discusses the ecosystem of plugins that have grown up around it, including some detailed examples of how you can write your own (and why you might want to).
Writing code that is easy to read and understand will have a lasting impact on you and your teammates over the life of a project. Sometimes it can be difficult to identify opportunities for simplifying a block of code, especially if you are early in your journey as a developer. If you work with senior engineers they can help by pointing out ways to refactor your code to be more readable, but they aren't always available. Brendan Maginnis and Nick Thapen created Sourcery to act as a full time pair programmer sitting in your editor of choice, offering suggestions and automatically refactoring your Python code. In this episode they share their journey of building a tool to automatically find opportunities for refactoring in your code, including how it works under the hood, the types of refactoring that it supports currently, and how you can start using it in your own work today. It always pays to keep your tool box organized and your tools sharp and Sourcery is definitely worth adding to your repertoire.
Becoming data driven is the stated goal of a large and growing number of organizations. In order to achieve that mission they need a reliable and scalable method of accessing and analyzing the data that they have. While business intelligence solutions have been around for ages, they don't all work well with the systems that we rely on today and a majority of them are not open source. Superset is a Python powered platform for exploring your data and building rich interactive dashboards that gets the information that your organization needs in front of the people that need it. In this episode Maxime Beauchemin, the creator of Superset, shares how the project got started and why it has become such a widely used and popular option for exploring and sharing data at companies of all sizes. He also explains how it functions, how you can customize it to fit your specific needs, and how to get it up and running in your own environment.
Python is a language that is used in almost every imaginable context and by people from an amazing range of backgrounds. A lot of the people who use it wouldn't even call themselves programmers, because that is not the primary focus of their job. In this episode Chris Moffitt shares his experience writing Python as a business user. In order to share his insights and help others who have run up against the limits of Excel he maintains the site Practical Business Python where he publishes articles that help introduce newcomers to Python and explain how to perform tasks such as building reports, automating Excel files, and doing data analysis. This is a great conversation that illustrates how useful it is to learn Python even if you never intend to write software professionally.
There are a large and growing number of businesses built by and for data science and machine learning teams that rely on Python. Tony Liu is a venture investor who is following that market closely and betting on its continued success. In this episode he shares his own journey into the role of an investor and discusses what he is most excited about in the industry. He also explains what he looks at when investing in a business and gives advice on what potential founders and early employees of startups should be thinking about when starting on that journey.
Jupyter notebooks are a dominant tool for data scientists, but they lack a number of conveniences for building reusable and maintainable systems. For machine learning projects in particular there is a need for being able to pivot from exploring a particular dataset or problem to integrating that solution into a larger workflow. Rick Lamers and Yannick Perrenet were tired of struggling with one-off solutions when they created the Orchest platform. In this episode they explain how Orchest allows you to turn your notebooks into executable components that are integrated into a graph of execution for running end-to-end machine learning workflows.
When you are writing a script it can become unwieldy to understand how the logic and data are flowing through the program. To make this easier to follow you can use a flow-based approach to building your programs. Leonn Thomm created the Ryven project as an environment for visually constructing a flow-based program. In this episode he shares his inspiration for creating the Ryven project, how it changes the way you think about program design, how Ryven is implemented, and how to get started with it for your own programs.
One of the perennial challenges in software engineering is to reduce the opportunity for bugs to creep into the system. Some of the tools in our arsenal that help in this endeavor include rich type systems, static analysis, writing tests, well defined interfaces, and linting. Phillip Schanely created the CrossHair project in order to add another ally in the fight against broken code. It sits somewhere between type systems, automated test generation, and static analysis. In this episode he explains his motivation for creating it, how he uses it for his own projects, and how to start incorporating it into yours. He also discusses the utility of writing contracts for your functions, and the differences between property based testing and SMT solvers. This is an interesting and informative conversation about some of the more nuanced aspects of how to write well-behaved programs.
Collaborating on software projects is largely a solved problem, with a variety of hosted or self-managed platforms to choose from. For data science projects, collaboration is still an open question. There are a number of projects that aim to bring collaboration to data science, but they are all solving a different aspect of the problem. Dean Pleban and Guy Smoilovsky created DagsHub to give individuals and teams a place to store and version their code, data, and models. In this episode they explain how DagsHub is designed to make it easier to create and track machine learning experiments, and serve as a way to promote collaboration on open source data science projects.
Creating well designed software is largely a problem of context and understanding. The majority of programming environments rely on documentation, tests, and code being logically separated despite being contextually linked. In order to weave all of these concerns together there have been many efforts to create a literate programming environment. In this episode Jeremy Howard of fast.ai fame and Hamel Husain of GitHub share the work they have done on nbdev. The explain how it allows you to weave together documentation, code, and tests in the same context so that it is more natural to explore and build understanding when working on a project. It is built on top of the Jupyter environment, allowing you to take advantage of the other great elements of that ecosystem, and it provides a number of excellent out of the box features to reduce the friction in adopting good project hygiene, including continuous integration and well designed documentation sites. Regardless of whether you have been programming for 5 days, 5 years, or 5 decades you should take a look at nbdev to experience a different way of looking at your code.
Working with network protocols is a common need for software projects, particularly in the current age of the internet. As a result, there are a multitude of libraries that provide interfaces to the various protocols. The problem is that implementing a network protocol properly and handling all of the edge cases is hard, and most of the available libraries are bound to a particular I/O paradigm which prevents them from being widely reused. To address this shortcoming there has been a movement towards "sans I/O" implementations that provide the business logic for a given protocol while remaining agnostic to whether you are using async I/O, Twisted, threads, etc. In this episode Aymeric Augustin shares his experience of refactoring his popular websockets library to be I/O agnostic, including the challenges involved in how to design the interfaces, the benefits it provides in simplifying the tests, and the work needed to add back support for async I/O and other runtimes. This is a great conversation about what is involved in making an ideal a reality.
One of the common complaints about Python is that it is slow. There are languages and runtimes that can execute code faster, but they are not as easy to be productive with, so many people are willing to make that tradeoff. There are some use cases, however, that truly need the benefit of faster execution. To address this problem Kevin Modzelewski helped to create the Pyston intepreter that is focused on speeding up unmodified Python code. In this episode he shares the history of the project, discusses his current efforts to optimize a fork of the CPython interpreter, and his goals for building a business to support the ongoing work to make Python faster for everyone. This is an interesting look at the opportunities that exist in the Python ecosystem and the work being done to address some of them.
Every software project has a certain amount of boilerplate to handle things like linting rules, test configuration, and packaging. Rather than recreate everything manually every time you start a new project you can use a utility to generate all of the necessary scaffolding from a template. This allows you to extract best practices and team standards into a reusable project that will save you time. The Copier project is one such utility that goes above and beyond the bare minimum by supporting project _evolution_, letting you bring in the changes to the source template after you already have a project that you have dedicated significant work on. In this episode Jairo Llopis explains how the Copier project works under the hood and the advanced capabilities that it provides, including managing the full lifecycle of a project, composing together multiple project templates, and how you can start using it for your own work today.
On its surface Python is a simple language which is what has contributed to its rise in popularity. As you move to intermediate and advanced usage you will find a number of interesting and elegant design elements that will let you build scalable and maintainable systems and design friendly interfaces. Luciano Ramalho is best known as the author of Fluent Python which has quickly become a leading resource for Python developers to increase their facility with the language. In this episode he shares his journey with Python and his perspective on how the recent changes to the interpreter and ecosystem are influencing who is adopting it and how it is being used. Luciano has an interesting perspective on how the feedback loop between the community and the language is driving the curent and future priorities of the features that are added.
Building a web application requires integrating a number of separate concerns into a single experience. One of the common requirements is a content management system to allow product owners and marketers to make the changes needed for them to do their jobs. Rather than spend the time and focus of your developers to build the end to end system a growing trend is to use a headless CMS. In this episode Jake Lumetta shares why he decided to spend his time and energy on building a headless CMS as a service, when and why you might want to use one, and how to integrate it into your applications so that you can focus on the rest of your application.
Notebooks have been a useful tool for analytics, exploratory programming, and shareable data science for years, and their popularity is continuing to grow. Despite their widespread use, there are still a number of challenges that inhibit collaboration and use by non-technical stakeholders. Barry McCardel and his team at Hex have built a platform to make collaboration on Jupyter notebooks a first class experience, as well as allowing notebooks to be parameterized and exposing the logic through interactive web applications. In this episode Barry shares his perspective on the state of the notebook ecosystem, why it is such as powerful tool for computing and analytics, and how he has built a successful business around improving the end to end experience of working with notebooks. This was a great conversation about an important piece of the toolkit for every analyst and data scientist.
When working with data it's important to understand when it is correct. If there is a time dimension, then it can be difficult to know when variation is normal. Anomaly detection is a useful tool to address these challenges, but a difficult one to do well. In this episode Smit Shah and Sayan Chakraborty share the work they have done on Luminaire to make anomaly detection easier to work with. They explain the complexities inherent to working with time series data, the strategies that they have incorporated into Luminaire, and how they are using it in their data pipelines to identify errors early. If you are working with any kind of time series then it's worth giving Luminaure a look.
Technologies for building data pipelines have been around for decades, with many mature options for a variety of workloads. However, most of those tools are focused on processing of text based data, both structured and unstructured. For projects that need to manage large numbers of binary and audio files the list of options is much shorter. In this episode Lynn Root shares the work that she and her team at Spotify have done on the Klio project to make that list a bit longer. She discusses the problems that are specific to working with binary data, how the Klio project is architected to allow for scalable and efficient processing of massive numbers of audio files, why it was released as open source, and how you can start using it today for your own projects. If you are struggling with ad-hoc infrastructure and a medley of tools that have been cobbled together for analyzing large or numerous binary assets then this is definitely a tool worth testing out.
Building a complete web application requires expertise in a wide range of disciplines. As a result it is often the work of a whole team of engineers to get a new project from idea to production. Meredydd Luff and his co-founder built the Anvil platform to make it possible to build full stack applications entirely in Python. In this episode he explains why they released the application server as open source, how you can use it to run your own projects for free, and why developer tooling is the sweet spot for an open source business model. He also shares his vision for how the end-to-end experience of building for the web should look, and some of the innovative projects and companies that were made possible by the reduced friction that the Anvil platform provides. Give it a listen today to gain some perspective on what it could be like to build a web app.
In a software project writing code is just one step of the overall lifecycle. There are many repetitive steps such as linting, running tests, and packaging that need to be run for each project that you maintain. In order to reduce the overhead of these repeat tasks, and to simplify the process of integrating code across multiple systems the use of monorepos has been growing in popularity. The Pants build tool is purpose built for addressing all of the drudgery and for working with monorepos of all sizes. In this episode core maintainers Eric Arellano and Stu Hood explain how the Pants project works, the benefits of automatic dependency inference, and how you can start using it in your own projects today. They also share useful tips for how to organize your projects, and how the plugin oriented architecture adds flexibility for you to customize Pants to your specific needs.
Building a machine learning model is a process that requires well curated and cleaned data and a lot of experimentation. Doing it repeatably and at scale with a team requires a way to share your discoveries with your teammates. This has led to a new set of operational ML platforms. In this episode Michael Del Balso shares the lessons that he learned from building the platform at Uber for putting machine learning into production. He also explains how the feature store is becoming the core abstraction for data teams to collaborate on building machine learning models. If you are struggling to get your models into production, or scale your data science throughput, then this interview is worth a listen.
The CPython implementation has grown and evolved significantly over the past ~25 years. In that time there have been many other projects to create compatible runtimes for your Python code. One of the challenges for these other projects is the lack of a fully documented specification of how and why everything works the way that it does. In the most recent Python language summit Mark Shannon proposed implementing a formal specification for CPython, and in this episode he shares his reasoning for why that would be helpful and what is involved in making it a reality.
Artificial intelligence applications can provide dramatic benefits to a business, but only if you can bring them from idea to production. Henrik Landgren was behind the original efforts at Spotify to leverage data for new product features, and in his current role he works on an AI system to evaluate new businesses to invest in. In this episode he shares advice on how to identify opportunities for leveraging AI to improve your business, the capabilities necessary to enable aa successful project, and some of the pitfalls to watch out for. If you are curious about how to get started with AI, or what to consider as you build a project, then this is definitely worth a listen.
Python and Java are two of the most popular programming languages in the world, and have both been around for over 20 years. In that time there have been numerous attempts to provide interoperability between them, with varying methods and levels of success. One such project is JPype, which allows you to use Java classes in your Python code. In this episode the current lead developer, Karl Nelson, explains why he chose it as his preferred tool for combining these ecosystems, how he and his team are using it, and when and how you might want to use it for your own projects. He also discusses the work he has done to enable use of JPype on Android, and what is in store for the future of the project. If you have ever wanted to use a library or module from Java, but the rest of your project is already in Python, then this episode is definitely worth a listen.
The release of Python 3.9 introduced a new parser that paves the way for brand new features. Every programming language has its own specific syntax for representing the logic that you are trying to express. The way that the rules of the language are defined and validated is with a grammar definition, which in turn is processed by a parser. The parser that the Python language has relied on for the past 25 years has begun to show its age through mounting technical debt and a lack of flexibility in defining new syntax. In this episode Pablo Galindo and Lysandros Nikolaou explain how, together with Python's creator Guido van Rossum, they replaced the original parser implementation with one that is more flexible and maintainable, why now was the time to make the change, and how it will influence the future evolution of the language.
The way that applications are being built and delivered has changed dramatically in recent years with the growing trend toward cloud native software. As part of this movement toward the infrastructure and orchestration that powers your project being defined in software, a new approach to operations is gaining prominence. Commonly called GitOps, the main principle is that all of your automation code lives in version control and is executed automatically as changes are merged. In this episode Victor Farcic shares details on how that workflow brings together developers and operations engineers, the challenges that it poses, and how it influences the architecture of your software. This was an interesting look at an emerging pattern in the development and release cycle of modern applications.
Learning to code is a neverending journey, which is why it's important to find a way to stay motivated. A common refrain is to just find a project that you're interested in building and use that goal to keep you on track. The problem with that advice is that as a new programmer, you don't have the knowledge required to know which projects are reasonable, which are difficult, and which are effectively impossible. Steven Lott has been sharing his programming expertise as a consultant, author, and trainer for years. In this episode he shares his insights on how to help readers, students, and colleagues interested enough to learn the fundamentals without losing sight of the long term gains. He also uses his own difficulties in learning to maintain, repair, and captain his sailboat as relatable examples of the learning process and how the lessons he has learned can be translated to the process of learning a new technology or skill. This was a great conversation about the various aspects of how to learn, how to stay motivated, and how to help newcomers bridge the gap between what they want to create and what is within their grasp.
Python is a powerful and expressive programming language with a vast ecosystem of incredible applications. Unfortunately, it has always been challenging to share those applications with non-technical end users. Gregory Szorc set out to solve the problem of how to put your code on someone else's computer and have it run without having to rely on extra systems such as virtualenvs or Docker. In this episode he shares his work on PyOxidizer and how it allows you to build a self-contained Python runtime along with statically linked dependencies and the software that you want to run. He also digs into some of the edge cases in the Python language and its ecosystem that make this a challenging problem to solve, and some of the lessons that he has learned in the process. PyOxidizer is an exciting step forward in the evolution of packaging and distribution for the Python language and community.
Servers and services that have any exposure to the public internet are under a constant barrage of attacks. Network security engineers are tasked with discovering and addressing any potential breaches to their systems, which is a never-ending task as attackers continually evolve their tactics. In order to gain better visibility into complex exploits Colin O'Brien built the Grapl platform, using graph database technology to more easily discover relationships between activities within and across servers. In this episode he shares his motivations for creating a new system to discover potential security breaches, how its design simplifies the work of identifying complex attacks without relying on brittle rules, and how you can start using it to monitor your own systems today.
News media is an important source of information for understanding the context of the world. To make it easier to access and process the contents of news sites Lucas Ou-Yang built the Newspaper library that aids in automatic retrieval of articles and prepare it for analysis. In this episode he shares how the project got started, how it is implemented, and how you can get started with it today. He also discusses how recent improvements in the utility and ease of use of deep learning libraries open new possibilities for future iterations of the project.
Data applications are complex and continually evolving, often requiring collaboration across multiple teams. In order to keep everyone on the same page a high level abstraction is needed to facilitate a cross-cutting view of the data orchestration across integration, transformation, analytics, and machine learning. Dagster is an innovative new framework that leans on the power and flexibility of Python to provide an extensible interface to the complete lifecycle of data projects. In this episode Nick Schrock explains how he designed the Dagster project to allow for integration with the entire data ecosystem while providing an opinionated structure for connecting the different stages of computation. He also discusses how he is working to grow an open ecosystem around the Dagster project, and his thoughts on building a sustainable business on top of it without compromising the integrity of the community. This was a great conversation about playing the long game when building a business while providing a valuable utility to a complex problem domain.
The internet is a rich source of information, but a majority of it isn't accessible programmatically through APIs or databases. To address that shortcoming there are a variety of web scraping frameworks that aid in extracting structured data from web pages. In this episode Attila Tóth shares the challenges of web data extraction, the ways that you can use it, and how Scrapy and ScrapingHub can help you with your projects.
A large portion of the software industry has standardized on Git as the version control sytem of choice. But have you thought about all of the information that you are generating with your branches, commits, and code changes? Davide Spadini created the PyDriller framework to simplify the work of mining software repositories to perform research on the technical and social aspects of software engineering. In this episode he shares some of the insights that you can gain by exploring the history of your code, the complexities of building a framework to interact with Git, and some of the interesting ways that PyDriller can be used to inform your own development practices.
The Musicbrainz project was an early entry in the movement to build an open data ecosystem. In recent years, the Metabrainz Foundation has fostered a growing ecosystem of projects to support the contribution of, and access to, metadata, listening habits, and review of music. The majority of those projects are written in Python, and in this episode Param Singh explains how they are built, how they fit together, and how they support the goals of the Metabrains Foundation. This was an interesting exporation of the work involved in building an ecosystem of open data, the challenges of making it sustainable, and the benefits of building for the long term rather than trying to achieve a quick win.
Python is a leading choice for data science due to the immense number of libraries and frameworks readily available to support it, but it is still difficult to scale. Dask is a framework designed to transparently run your data analysis across multiple CPU cores and multiple servers. Using Dask lifts a limitation for scaling your analytical workloads, but brings with it the complexity of server administration, deployment, and security. In this episode Matthew Rocklin and Hugo Bowne-Anderson discuss their recently formed company Coiled and how they are working to make use and maintenance of Dask in production. The share the goals for the business, their approach to building a profitable company based on open source, and the difficulties they face while growing a new team during a global pandemic.
Netflix uses machine learning to power every aspect of their business. To do this effectively they have had to build extensive expertise and tooling to support their engineers. In this episode Savin Goyal discusses the work that he and his team are doing on the open source machine learning operations platform Metaflow. He shares the inspiration for building an opinionated framework for the full lifecycle of machine learning projects, how it is implemented, and how they have designed it to be extensible to allow for easy adoption by users inside and outside of Netflix. This was a great conversation about the challenges of building machine learning projects and the work being done to make it more achievable.
One of the best methods for learning programming is to just build a project and see how things work first-hand. With that in mind, Ken Youens-Clark wrote a whole book of Tiny Python Projects that you can use to get started on your journey. In this episode he shares his inspiration for the book, his thoughts on the benefits of teaching testing principles and the use of linting and formatting tools, as well as the benefits of trying variations on a working program to see how it behaves. This was a great conversation about useful strategies for supporting new programmers in their efforts to learn a valuable skill.
Python is an intuitive and flexible language, but that versatility can also lead to problematic designs if you're not careful. Nikita Sobolev is the CTO of Wemake Services where he works on open source projects that encourage clean coding practices and maintainable architectures. In this episode he discusses his work on the DRY Python set of libraries and how they provide an accessible interface to functional programming patterns while maintaining an idiomatic Python interface. He also shares the story behind the wemake Python styleguide plugin for Flake8 and the benefits of strict linting rules to engender good development habits. This was a great conversation about useful practices to build software that will be easy and fun to work on.
Barry Warsaw has been a member of the Python community since the very beginning. His contributions to the growth of the language and its ecosystem are innumerable and diverse, earning him the title of Friendly Language Uncle For Life. In this episode he reminisces on his experiences as a core developer, a member of the Python Steering Committee, and his roles at Canonical and LinkedIn supporting the use of Python at those companies. In order to know where you are going it is always important to understand where you have been and this was a great conversation to get a sense of the history of how Python has gotten to where it is today.
Building and managing servers is a challenging task. Configuration management tools provide a framework for handling the various tasks involved, but many of them require learning a specific syntax and toolchain. PyInfra is a configuration management framework that embraces the familiarity of Pure Python, allowing you to build your own integrations easily and package it all up using the same tools that you rely on for your applications. In this episode Nick Barrett explains why he built it, how it is implemented, and the ways that you can start using it today. He also shares his vision for the future of the project and you can get involved. If you are tired of writing mountains of YAML to set up your servers then give PyInfra a try today.
Programming languages are a powerful tool and can be used to create all manner of applications, however sometimes their syntax is more cumbersome than necessary. For some industries or subject areas there is already an agreed upon set of concepts that can be used to express your logic. For those cases you can create a Domain Specific Language, or DSL to make it easier to write programs that can express the necessary logic with a custom syntax. In this episode Igor Dejanović shares his work on textX and how you can use it to build your own DSLs with Python. He explains his motivations for creating it, how it compares to other tools in the Python ecosystem for building parsers, and how you can use it to build your own custom languages.
Once you release an application into production it can be difficult to understand all of the ways that it is interacting with the systems that it integrates with. The OpenTracing project and its accompanying ecosystem of technologies aims to make observability of your systems more accessible. In this episode Austin Parker and Alex Boten explain how the correlation of tracing and metrics collection improves visibility of how your software is behaving, how you can use the Python SDK to automatically instrument your applications, and their vision for the future of observability as the OpenTelemetry standard gains broader adoption.
Our thought patterns are rarely linear or hierarchical, instead following threads of related topics in unpredictable directions. Topic modeling is an approach to knowledge management which allows for forming a graph of associations to make capturing and organizing your thoughts more natural. In this episode Brett Kromkamp shares his work on the Contextualize project and how you can use it for building your own topic models. He explains why he wrote a new topic modeling engine, how it is architected, and how it compares to other systems for organizing information. Once you are done listening you can take Contextualize for a test run for free with his hosted instance.
You spend a lot of time and energy on building a great application, but do you know how it's actually being used? Using a product analytics tool lets you gain visibility into what your users find helpful so that you can prioritize feature development and optimize customer experience. In this episode PostHog CTO Tim Glaser shares his experience building an open source product analytics platform to make it easier and more accessible to understand your product. He shares the story of how and why PostHog was created, how to incorporate it into your projects, the benefits of providing it as open source, and how it is implemented. If you are tired of fighting with your user analytics tools, or unwilling to entrust your data to a third party, then have a listen and then test out PostHog for yourself.
The divide between Python 2 and 3 lasted a long time, and in recent years all of the new features were added to version 3. To help bridge the gap and extend the viability of version 2 Naftali Harris created Tauthon, a fork of Python 2 that backports features from Python 3. In this episode he explains his motivation for creating it, the process of maintaining it and backporting features, and the ways that it is being used by developers who are unable to make the leap. This was an interesting look at how things might have been if the elusive Python 2.8 had been created as a more gentle transition.
Dependency management in Python has taken a long and winding path, which has led to the current dominance of Pip. One of the remaining shortcomings is the lack of a robust mechanism for resolving the package and version constraints that are necessary to produce a working system. Thankfully, the Python Software Foundation has funded an effort to upgrade the dependency resolution algorithm and user experience of Pip. In this episode the engineers working on these improvements, Pradyun Gedam, Tzu-Ping Chung, and Paul Moore, discuss the history of Pip, the challenges of dependency management in Python, and the benefits that surrounding projects will gain from a more robust resolution algorithm. This is an exciting development for the Python ecosystem, so listen now and then provide feedback on how the new resolver is working for you.
Summary One of the most common causes of bugs is incorrect data being passed throughout your program. Pydantic is a library that provides runtime checking and validation of the information that you rely on in your code. In this episode Samuel Colvin explains why he created it, the interesting and useful ways that it can be used, and how to integrate it into your own projects. If you are tired of unhelpful errors due to bad data then listen now and try it out today.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * You listen to this show because you love Python and want to keep your skills up to date. Machine learning is finding its way into every aspect of software engineering. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. Podcast.__init__ is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to pythonpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. * Your host as usual is Tobias Macey and today I’m interviewing Samuel Colvin about Pydantic, a library for enforcing type hints at runtime
Interview * Introductions * How did you get introduced to Python? * Can you start by describing what Pydantic is and what motivated you to create it? * What are the main use cases that benefit from Pydantic? * There are a number of libraries in the Python ecosystem to handle various conventions or "best practices" for settings management. How does pydantic fit in that category and why might someone choose to use it over the other options? * There are also a number of libraries for defining data schemas or validation such as Marshmallow and Cerberus. How does Pydantic compare to the available options for those cases? + What are some of the challenges, whether technical or conceptual, that you face in building a library to address both of these areas? * The 3.7 release of Python added built in support for dataclasses as a means of building containers for data with type validation. What are the tradeoffs of pydantic vs the built in dataclass functionality? * How much overhead does pydantic add for doing runtime validation of the modelled data? * In the documentation there is a nuanced point that you make about parsing vs validation and your choices as to what to support in pydantic. Why is that a necessary distinction to make? + What are the limitations in terms of usage that you are accepting by choosing to allow for implicit conversion or potentially silent loss of precision in the parsed data? + What are the benefits of punting on the strict validation of data out of the box? * What has been your design philosophy for constructing the user facing API? * How is Pydantic implemented and how has the overall architecture evolved since you first began working on it? + What have you found to be the most challenging aspects of building a library for managing the consistency of data structures in a dynamic language? - What are some of the strengths and weaknesses of Python’s type system? * What is the workflow for a developer who is using Pydantic in their code? + What are some of the pitfalls or edge cases that they might run into? * What is involved in integrating with other libraries/frameworks such as Django for web development or Dagster for building data pipelines? * What are some of the more advanced capabilities or use cases of Pydantic that are less obvious? * What are some of the features or capabilities of Pydantic that are often overlooked which you think should be used more frequently? * What are some of the most interesting, innovative, or unexpected ways that you have seen Pydantic used? * What are some of the most interesting, challenging, or unexpected lessons that you have learned through your work on or with Pydantic? * When is Pydantic the wrong choice? * What do you have planned for the future of the project?
Keep In Touch * samuelcolvin on GitHub * Website * LinkedIn * @samuel_colvin on Twitter
Picks * Tobias + Devil Sticks * Samuel + Flash Boys by Michael Lewis + Algorithms To Live By by Brian Christian and Tom Griffiths + NGrok.com
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * Pydantic * Matlab * C# * FastAPI + Podcast Episode * Marshmallow + Podcast Episode * Cerberus * 12 Factor App * Django * Python Type Hints * Cython + Podcast Episode * MyPy + Podcast Episode * Duck Typing * Haskell * Higher Order Types * PyCharm Pydantic Plugin * Django Rest Framework * Avro * Parquet * Dagster + Data Engineering Podcast Episode * Starlette * Flask * Ludwig * Deep Pavlov * Fast MRI * Reagent * Pynt * Open Source Has Failed article
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary More of us are working remotely than ever before, many with no prior experience with a remote work environment. In this episode Quinn Slack discusses his thoughts and experience of running Sourcegraph as a fully distributed company. He covers the lessons that he has learned in moving from partially to fully remote, the practices that have worked well in managing a distributed workforce, and the challenges that he has faced in the process. If you are struggling with your remote work situation then this conversation has some useful tips and references for further reading to help you be successful in the current environment.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to pythonpodcast.com/tidydata today and get started for free with no credit card required. * Your host as usual is Tobias Macey and today I’m interviewing Quinn Slack about his experience managing a fully remote company and useful tips for remote work
Interview * Introductions * How did you get introduced to Python? * Can you start by giving an overview of the team structure at Sourcegraph? * You recently moved to being fully remote. What was the motivating factor and how has it changed your personal workflow? + What is your prior history with working remote? * team practices for visibility of progress * impact of remote teams on how code is written and organized + reducing review burden by writing clearer code * structuring meetings when remote * points of friction for remote developer teams * benefits of being fully remote * incentivizing documentation * compensation structure
Keep In Touch * LinkedIn * @sqs on Twitter * sqs on GitHub
Picks * Tobias + Joplin App * Quinn + Skunkworks by Ben Rich
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * Sourcegraph * Quinn’s Python Search Engine * Sourcegraph Employee Handbook * Gitlab * Gitlab Handbook * Zapier * Zapier Guide To Remote Work * Automattic * Automattic Blog On Distributed Work * Comments Showing Intent
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary After you write your application, you need a way to make it available to your users. These days, that usually means deploying it to a cloud provider, whether that’s a virtual server, a serverless platform, or a Kubernetes cluster. To manage the increasingly dynamic and flexible options for running software in production, we have turned to building infrastructure as code. Pulumi is an open source framework that lets you use your favorite language to build scalable and maintainable systems out of cloud infrastructure. In this episode Luke Hoban, CTO of Pulumi, explains how it differs from other frameworks for interacting with infrastructure platforms, the benefits of using a full programming language for treating infrastructure as code, and how you can get started with it today. If you are getting frustrated with switching contexts when working between the application you are building and the systems that it runs on, then listen now and then give Pulumi a try.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to pythonpodcast.com/tidydata today and get started for free with no credit card required. * Your host as usual is Tobias Macey and today I’m interviewing Luke Hoban about building and maintaining infrastructure as code with Pulumi
Interview * Introductions * How did you get introduced to Python? * Can you start by describing the concept of "infrastructure as code"? * What is Pulumi and what is the story behind it? + Where does the name come from? + How does Pulumi compare to other infrastructure as code frameworks, such as Terraform? * What are some of the common challenges in managing infrastructure as code? + How does use of a full programming language help in addressing those challenges? + What are some of the dangers of using a full language to manage infrastructure? - How does Pulumi work to avoid those dangers? * Why is maintaining a record of the provisioned state of your infrastructure necessary, as opposed to relying on the state contained by the infrastructure provider? + What are some of the design principles and constraints that developers should be considering as they architect their infrastructure with Pulumi? * Can you describe how Pulumi is implemented? + How does Pulumi manage support for multiple languages while maintaining feature parity across them? + How do you manage testing and validation of the different providers? * The strength of any tool is largely measured in the ecosystem that exists around it, which is one of the reasons that Terraform has been so successful. How are you approaching the problem of bootstrapping the community and prioritizing platform support? * Can you talk through the workflow of working with Pulumi to build and maintain a proper infrastructure? * What are some of the ways to approach testing of infrastructure code? + What does the CI/CD lifecycle for infrastructure look like? * What are the limitations of infrastructure as code? + How do configuration management tools fit with frameworks such as Pulumi? * The core framework of Pulumi is open source, and your business model is focused around a managed platform for tracking state. How are you approaching governance of the project to ensure its continued viability and growth? * What are some of the most interesting, innovative, or unexpected design patterns that you have seen your users include in their infrastructure projects? * When is Pulumi the wrong choice? * What do you have planned for the future of Pulumi?
Keep In Touch * LinkedIn * lukehoban on GitHub * @lukehoban on Twitter
Picks * Tobias + Bookshelf App * Luke + GoBinaries.com
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * Pulumi * Terraform * IronPython * HCL == Hashicorp Config Language * Kubernetes * TypeScript * DevOps * CloudFormation * ARM == Azure Resource Manager * AWSx * GCP == Google Cloud Platform * Pulumi SaaS * SaltStack + Podcast Episode * Ansible * Elastic Beanstalk
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary Python has become a major player in the machine learning industry, with a variety of widely used frameworks. In addition to the technical resources that make it easy to build powerful models, there is also a sizable library of educational resources to help you get up to speed. Sebastian Raschka’s contribution of the Python Machine Learning book has come to be widely regarded as one of the best references for newcomers to the field. In this episode he shares his experiences as an author, his views on why Python is the right language for building machine learning applications, and the insights that he has gained from teaching and contributing to the field.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * Your host as usual is Tobias Macey and today I’m interviewing Sebastian Raschka about his experiences writing the popular Python Machine Learning book
Interview * Introductions * How did you get introduced to Python? * How did you get started in machine learning? + What were the concepts that you found most difficult in your career with statistics and machine learning? * One of your notable contributions to the field is your book "Python Machine Learning". What inspired you to write the initial version? + How did you approach the challenge of striking the right balance of depth, breadth, and accessibility for the content? + What was your process for determining which aspects of machine learning to include? * You have made 3 editions of the book from 2015 through December of 2019. In what ways has the book changed? + What are the biggest changes to the ecosystem and approaches to ML in that timeframe? * What are the fundamental challenges of developing machine learning projects that continue to present themselves? + What new difficulties have arisen with the introduction of new technologies and the rise of deep learning? * What are some of the ways that the Python language lends itself to analytical work? + What are its shortcomings and how has the community worked around them? + What do you see as the biggest risks to the popularity of Python in the data and analytics space? * What are some of the common pitfalls that your readers and students face while learning about different aspects of machine learning? * What are some of the industries that can benefit most from applications of machine learning? * What are you most excited about in the applications or capabilities of machine learning? + What are you most worried about?
Keep In Touch * Website * @rasbt on Twitter * rasbt on GitHub * LinkedIn
Picks * Tobias + Trolls World Tour * Sebastian + FFMPeg Normalize
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * Python Machine Learning (Packt) + Buy On Amazon (affiliate link) * UW Madison * Pascal * Delphi * R * Perl * Bioinformatics + Seq - Podcast Episode + BioPython - Podcast Episode * CodeCademy * Udacity CS101 * Andrew Ng * Coursera * Support-Vector Machine * Bayesian Statistics * Matlab * scikit-learn * NumPy * Pandas + Podcast Episode * Sebastian’s Blog * Perceptron * Heatmaps In R * The Hundred Page Machine Learning Book by Andriy Burkov * ImageNet * Random Forest * Logistic Regression * XGBoost * Theano * Generative Adversarial Networks * Is This Person Real / This Person Does Not Exist * Reinforcement Learning * AlphaGo * AlphaStar * Ray * RLlib * Open AI * Google DeepMind * Google Colab * CUDA * Julia * Sebastian Raschka, Joshua Patterson, and Corey Nolet (2020). Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information 2020, 11, 193 * Swift Language * Swift for TensorFlow * Matplotlib * Differential Privacy * PrivacyNet * YouTube recordings of Stat453: Introduction to Deep Learning and Generative Models (Spring 2020) * ffmpeg-normalize
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary Python has an embarrasment of riches when it comes to web frameworks, each with their own particular strengths. FastAPI is a new entrant that has been quickly gaining popularity as a performant and easy to use toolchain for building RESTful web services. In this episode Sebastián Ramirez shares the story of the frustrations that led him to create a new framework, how he put in the extra effort to make the developer experience as smooth and painless as possible, and how he embraces extensability with lightweight dependency injection and a straightforward plugin interface. If you are starting a new web application today then FastAPI should be at the top of your list.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * Your host as usual is Tobias Macey and today I’m interviewing Sebastián Ramirez about FastAPI, a framework for building production ready APIs in Python 3
Interview * Introductions * How did you get introduced to Python? * Can you start by describing what FastAPI is? + What are the main frustrations that you ran into with other frameworks that motivated you to create an entirely new one? * What are some of the main use cases that FastAPI is designed for? * Many web frameworks focus on managing the end-to-end functionality of a website, including the UI. Why did you focus on just API capabilities? + What are the benefits of building an API only framework? + If you wanted to integrate a presentation layer, what would be involved in that effort? * What API formats does FastAPI support? + What would be involved in adding support for additional specifications such as GraphQL or JSON-LD? * There are a huge number of web frameworks available just in the Python ecosystem. How does FastAPI fit into that landscape and why might someone choose it over the other options? * Can you share your design philosophy for the project? + What are your main sources of inspiration for the framework? + You have also built the Typer CLI library which you refer to as the little sibling of FastAPI. How have your experiences building these two projects influenced their counterpart’s evolution? * What are the benefits of incorporating type annotations into a web framework and in what ways do they manifest in its functionality? * What is the workflow for a developer building a complex application in FastAPI? * Can you describe how FastAPI itself is architected and how its design has evolved since you first began working on it? + What are the extension points that are available for someone to build plugins for FastAPI? * What are some of the challenges that you have faced in building an async framework that is leveraging the new ASGI specification? * What are some sharp edges that users should keep an eye out for? * What are some unique or underutilized features of FastAPI that users might not be aware of? * What are some of the most interesting, unexpected, or innovative ways that you have seen FastAPI used? * When is FastAPI the wrong choice? * What are some of the most interesting, unexpected, or challenging lessons that you have learned in the process of building and maintaining FastAPI? * What do you have planned for the future of the project?
Keep In Touch @tiangolo on Twitter. @tiangolo on GitHub.
Picks * Tobias + Once Upon A Time TV Show * Sebastián + Cloud Atlas Movie + Isaac Asimov’s robot short stories + Python devtools debug function + async compatible requests with HTTPX + RescueTime for automatic time tracking + Joplin for Notes
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * FastAPI * Typer * Typer CLI * FastAPI Alternatives, Inspiration and Comparisons * Explosion’s spaCy * Explosion’s Prodigy * Starlette * Pydantic * Uvicorn * Hypercorn * fastapi-utils + Class Based Views * GrahQL Ariadne * Coronavirus Tracker API * Terminals from browser: termpair * XPublish * Uber’s Ludwig * Netflix Dispatch * Colombia * Berlin Germany * Explosion AI * Python Type Annotations * Django Rest Framework * Flask * Swagger/OpenAPI * Sanic * NodeJS * JSON Schema * OAuth2 * Swagger UI * ReDoc * React * VueJS * Angular * REST == REpresentational State Transfer * JSON-LD * Go Language * Hug API framework * Click CLI Framework * Flask Blueprints * Tom Christie + Podcast Interview * Dependency Injection * ASGI + Podcast Episode * WSGI * Thread Local Variables * Context Vars * OAUTH2 Scopes * PipX * XArray * JAM Stack * NextJS * Hugo * GatsbyJS * FastAPI Project Templates
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary Distributed computing is a powerful tool for increasing the speed and performance of your applications, but it is also a complex and difficult undertaking. While performing research for his PhD, Robert Nishihara ran up against this reality. Rather than cobbling together another single purpose system, he built what ultimately became Ray to make scaling Python projects to multiple cores and across machines easy. In this episode he explains how Ray allows you to scale your code easily, how to use it in your own projects, and his ambitions to power the next wave of distributed systems at Anyscale. If you are running into scaling limitations in your Python projects for machine learning, scientific computing, or anything else, then give this a listen and then try it out!
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * Your host as usual is Tobias Macey and today I’m interviewing Robert Nishihara about Ray, a framework for building and running distributed applications and machine learning
Interview * Introductions * How did you get introduced to Python? * Can you start by describing what Ray is and how the project got started? + How did the environment of the RISE lab factor into the early design and development of Ray? * What are some of the main use cases that you were initially targeting with Ray? + Now that it has been publicly available for some time, what are some of the ways that it is being used which you didn’t originally anticipate? * What are the limitations for the types of workloads that can be run with Ray, or any edge cases that developers should be aware of? * For someone who is building on top of ray, what is involved in either converting an existing application to take advantage of Ray’s parallelism, or creating a greenfield project with it? * Can you describe how Ray itself is implemented and how it has evolved since you first began working on it? * How does the clustering and task distriubtion mechanism in Ray work? * How does the increased parallelism that Ray offers help with machine learning workloads? + Are there any types of ML/AI that are easier to do in this context? * What are some of the additional layers or libraries that have been built on top of the functionality of Ray? * What are some of the most interesting, challenging, or complex aspects of building and maintaining Ray? * You and your co-founders recently announced the formation of Anyscale to support the future development of Ray. What is your business model and how are you approaching the governance of Ray and its ecosystem? * What are some of the most interesting or unexpected projects that you have seen built with Ray? * What are some cases where Ray is the wrong choice? * What do you have planned for the future of Ray and Anyscale?
Keep In Touch * Website * @robertnishihara on Twitter * robertnishihara on GitHub
Picks * Tobias + D&D Castle Ravenloft board game + One Deck Dungeon * Robert + The Everything Store
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * Ray * Anyscale * UC Berkeley * RISELab * MATLAB * Deep Learning * Theano * Tensorflow * PyTorch + Podcast Episode * Philip Moritz * Reinforcement Learning * Hyperparameter Tuning * IPython Parallel * AMPLab * Apache Spark + Data Engineering Podcast Episode * Actor Model * Horovod(?) * Flink + Data Engineering Podcast Episode * Spark Streaming * Dask + Data Engineering Podcast Episode * gRPC * Tune * Rust * C++ * C * Apache Arrow * Wes McKinney + Podcast Interview * DataBricks * MongoDB * Elastic + Data Engineering Podcast Episode * Confluent * Embarassingly Parallel * Ant Financial * Flame Graph
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary Bioinformatics is a complex and computationally demanding domain. The intuitive syntax of Python and extensive set of libraries make it a great language for bioinformatics projects, but it is hampered by the need for computational efficiency. Ariya Shajii created the Seq language to bridge the divide between the performance of languages like C and C++ and the ecosystem of Python with built-in support for commonly used genomics algorithms. In this episode he describes his motivation for creating a new language, how it is implemented, and how it is being used in the life sciences. If you are interested in experimenting with sequencing data then give this a listen and then give Seq a try!
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. * Your host as usual is Tobias Macey and today I’m interviewing Ariya Shajii about Seq, a programming language built for bioinformatics and inspired by Python
Interview * Introductions * How did you get introduced to Python? * Can you start by describing what Seq is and your motivation for creating it? + What was lacking in other languages or libraries for your use case that is made easier by creating a custom language? + If someone is already working in Python, possibly using BioPython, what might motivate them to consider migrating their work to Seq? * Can you give an impression of the scope and nature of the tasks or projects that a biologist or geneticist might build with Seq? * What was your process for identifying and prioritizing features and algorithms that would be beneficial to the target audience? * For someone using Seq can you describe their workflow and how it might differ from performing the same task in Python? * How is Seq implemented? + What are some of the features that are included to simplify the work of bioinformatics? + What was your process of designing the language and runtime? + How has the scope or direction of the project evolved since it was first conceived? * What impact do you anticipate Seq having on the domain of bioinformatics and genomics? * What have you found to be the most interesting, unexpected, and/or challenging aspects of building a language for this problem domain? * What is in store for the future of Seq?
Keep In Touch * arshajii on GitHub * Website
Picks * Tobias + Board Games + Labyrinth Boardgame + Board Game Geek * Ariya + Breakthrough documentary
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * Seq * MIT CSAIL * Bioinformatics * LLVM * Intermediate Representation * MatLab * Moore’s Law * BioPython * Smith Waterman Algorithm * Hamming Distance * Pattern Matching in Functional Programming * SIMD == Single Instruction Multiple Data * Computational Genomics * Phylogenetics * Sequence Read Archive public data set * Google Cloud Life Sciences
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of SpaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. * Your host as usual is Tobias Macey and today I’m interviewing Matthew Honnibal about the Thinc and Prodigy tools and an update on SpaCy
Interview * Introductions * How did you get introduced to Python? * Can you start by giving an overview of your mission at Explosion? * We spoke previously about your work on SpaCy. What has changed in the past 3 1/2 years? + How have recent innovations in language models such as BERT and GPT-2 influenced the direction or implementation of the project? * When I last looked SpaCy only supported English and German, but you have added several new languages. What are the most challenging aspects of building the additional models? + What would be required for supporting symbolic or right-to-left languages? * How has the ecosystem for language processing in Python shifted or evolved since you first introduced SpaCy? * Another project that you have released is Prodigy to support labelling of datasets. Can you talk through the motivation for creating it and describe the workflow for someone using it? + What was lacking in the other annotation tools that you have worked with that you are trying to solve for in Prodigy? * What are some of the most challenging or problematic aspects of labelling data sets for use in machine learning projects? + What is a typical scale of data that can be reasonably handled by an individual or small team working with Prodigy? - At what point do you find that it makes sense to use a labeling service rather than generating the labels yourself? * Your most recent project is Thinc for building and using deep learning models. What was the motivation for creating it and what problem does it solve in the ecosystem? + How does its design and usage compare to other deep learning frameworks such as PyTorch and Tensorflow? + How does it compare to projects such as Keras that abstract across those frameworks? * How do the SpaCy, Prodigy, and Thinc libraries work together? * What are some of the biggest challenges that you are facing in building open source tools to meet the needs of data scientists and machine learning engineers? * What are some of the most interesting or impressive projects that you have seen built with the tools your team is creating? * What do you have planned for the future of Explosion, SpaCy, Prodigy, and Thinc?
Keep In Touch * LinkedIn * @honnibal on Twitter * honnibal on GitHub
Picks * Tobias + Onward movie * Matthew + Coronavirus Preparedness + Ray
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * Explosion AI * SpaCy + Podcast Episode * Thinc * Prodigy * Natural Language Processing * Perl * NLTK * GPU == Graphics Processing Unit * TPU == Tensor Processing Unit * Transfer Learning * Airflow * Luigi * Perceptron * PyTorch * Tensorflow * Functional Programming * MxNet * Keras * Cuda * C Language * Continuous Integration * Blackstone * Allen AI Institute * SciSpaCy * Holmes * Sense2Vec * FastAPI
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary Running a successful business requires some method of organizing the information about all of the processes and activity that take place. Tryton is an open source, modular ERP framework that is built for the flexibility needed to fit your organization, rather than requiring you to model your workflows to match the software. In this episode core developers Nicolas Évrard and Cédric Krier are joined by avid user Jonathan Levy to discuss the history of the project, how it is being used, and the myriad ways that you can adapt it to suit your needs. If you are struggling to keep a consistent view of your business and ensure that all of the necessary workflows are being observed then listen now and give Tryton a try.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. * Your host as usual is Tobias Macey and today I’m interviewing Nicolas Évrard, Cédric Krier, and Jonathan Levy about Tryton
Interview * Introductions * How did you get introduced to Python? * Can you start by describing what Tryton is and how it got started? * What kinds of businesses is Tryton most suited to? + What kinds of businesses is Tryton not a good fit for? * Within a business, who are the primary users of Tryton? * Can you talk through a typical workflow for interacting with Tryton? * What are some of the most complex or challenging aspects of modeling a business while maintaining a high degree of customizability? * Can you describe how Tryton is architected and how its design has evolved since it was first started? + If you were to start over today, what would you do differently? * There are a number of plugins for Tryton. What kinds of functionality can be customized using the available interfaces? + What is the process for building a custom module for Tryton? * How do you manage sustainability of the Tryton project? * Given the criticality of the Tryton platform, how do you approach ongoing stability and security of the project? * What is involved in deploying and maintaining an installation of Tryton? * What are some of the most interesting, innovative, or unexpected ways that you have seen Tryton used? * What is in store for the future of Tryton?
Keep In Touch * Nicolas + nicoe on GitHub + @nicoe on Twitter * Cédric + @cedrickrier on Twitter + cedk on GitHub * Jonathan + LinkedIn
Picks * Tobias + Audio Books - Audible free trial (Affiliate Link) - Overdrive – ebooks and audiobooks from your local library - Public Domain Audiobooks * Nicolas + Civilization VI + FreeCiv + The 3 Body Problem * Cédric + Valérian and Laureline * Jonathan + Roil.com
Links * Tryton * B2CK * Tryton Foundation * Advocate Consulting Legal Group * Scheme * Lisp * Belgium * EuroPython Conference * Plone * Zope * VBA (Visual Basic for Applications) * Django * Odoo * ERP == Enterprise Resource Planning * Small/Medium Enterprise (SME) * GTK (Gnome ToolKit) * 3-Tier Application * Cookiecutter * Tryton Module Cookiecutter * Tryton Repository * Docker * GNU Health * Nereid
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary One of the driving factors of Python’s success is the ability for developers to integrate with performant languages such as C and C++. The challenge is that the interface for those extensions is specific to the main implementation of the language. This contributes to difficulties in building alternative runtimes that can support important packages such as NumPy. To address this situation a team of developers are working to create the hpy project, a new interface for extension developers that is standardized and provides a uniform target for multiple runtimes. In this episode Antonio Cuni discusses the motivations for creating hpy, how it benefits the whole ecosystem, and ways to contribute to the effort. This is an exciting development that has the potential to unlock a new wave of innovation in the ways that you can run your Python code.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * As a developer, maintaining a state of flow is key to your productivity. Don’t let something as simple as the wrong function ruin your day. Kite is the smartest completions engine available for Python, featuring a machine learning model trained by the brightest stars of GitHub. Featuring ranked suggestions sorted by relevance, offering up to full lines of code, and a programming copilot that offers up the documentation you need right when you need it. Get Kite for free today at getkite.com with integrations for top editors, including Atom, VS Code, PyCharm, Spyder, Vim, and Sublime. * You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. * Your host as usual is Tobias Macey and today I’m interviewing Antonio Cuni about hpy, a project aiming to reimagine the C API for Python
Interview * Introductions * How did you get introduced to Python? * Can you start by describing what the hpy project is and how it got started? + What are the goals for the project? + Who else is involved? * How much engagement have you had with CPython core contributors or the steering council? * Who are the consumers of the current C API for the CPython implementation? + What are some of the pain points or shortcomings for those consumers? + What impact does that have for users of a given library that leverages C extensions? * Can you talk through the structure of the hpy project? + What are some of the design challenges that you are facing for determining the external API? + What is involved in integrating the hpy interface into alternate runtimes such as PyPy or RustPython? * What is the potential or observed performance impact for libraries that currently rely on the existing C API? * How has the vision and scope of this project been updated as you have gotten further along in the implementation? * What are the downstream impacts that you anticipate in projects such as PyPy and Cython? * What have you found to be the most challenging or contentious aspects of implementing hpy so far? * What are some of the most interesting/unexpected/useful lessons that you have learned while working on hpy? * What do you have planned for the near to medium term for hpy?
Keep In Touch * antocuni on GitHub * Website * @antocuni on Twitter
Picks * Tobias + Poetry * Antonio + Collapse: How Societies Choose To Fail Or Succeed by Jared Diamond
Links * hpy * PyPy * Alex Martelli + Podcast Interview * Python C Extensions * EuroPython * Victor Stinner * Cython + Podcast Episode * Armin Rigo * NumPy * ultrajson * GIL == Global Interpreter Lock * RustPython + Podcast Episode * GraalPython * hpy-rust
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary Quantum computers promise the ability to execute calculations at speeds several orders of magnitude faster than what we are used to. Machine learning and artificial intelligence algorithms require fast computation to churn through complex data sets. At Xanadu AI they are building libraries to bring these two worlds together. In this episode Josh Izaac shares his work on the Strawberry Fields and Penny Lane projects that provide both high and low level interfaces to quantum hardware for machine learning and deep neural networks. If you are itching to get your hands on the coolest combination of technologies, then listen now and then try it out for yourself.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * As a developer, maintaining a state of flow is key to your productivity. Don’t let something as simple as the wrong function ruin your day. Kite is the smartest completions engine available for Python, featuring a machine learning model trained by the brightest stars of GitHub. Featuring ranked suggestions sorted by relevance, offering up to full lines of code, and a programming copilot that offers up the documentation you need right when you need it. Get Kite for free today at getkite.com with integrations for top editors, including Atom, VS Code, PyCharm, Spyder, Vim, and Sublime. * You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. * Your host as usual is Tobias Macey and today I’m interviewing Josh Izaac about how the work that he is doing at Xanadu AI to make it easier to build applications for quantum processors
Interview * Introductions * How did you get introduced to Python? * Can you start by describing what you are working on at Xanadu AI? + How do the specifics of your quantum hardware influence the way in which developers need to build their algorithms? (e.g. as compared to DWave) * What are some of the underlying principles that developers need to understand in order to take full advantage of the capabilities provided by quantum processors? * Can you outline the different components and libraries that you are building to simplify the work of building machine learning/AI projects for quantum processors? + What’s the story behind all of the Beatles references? + How do the different libraries fit together? * What are some of the workloads and use cases that you and your customers are focused on? * What are some of the most challenging aspects of designing a library that is accessible to developers while being able to take advantage of the underlying hardware? * How does the workflow for machine learning on quantum computers differ from what is being done in classical environments? + Given the magnitude of computational power and data processing that can be achieved in a quantum processor it seems that there is a potential for small bugs to have disproportionately large impacts. How can developers identify and mitigate potential sources of error in their algorithms? * For someone who is building an application or algorithm to be executed on a Xanadu processor, what does their workflow look like? + What are some of the common errors or misconceptions that you have seen in customer code? * Can you describe the design and implementation of the Penny Lane and Strawberry Fields libraries and how they have evolved since you first began working on them? * What are some of the most ambitious or exciting use cases for quantum systems that you have seen? * How are you using the computational capabilities of your platform to feed back into the research and design of successive generations of hardware? * What are some useful heuristics for determining whether it is worthwhile to build for a quantum processor rather than leveraging classical hardware? * What are some of the most interesting/unexpected/useful lessons that you have learned while working on quantum algorithms and the libraries to support them? * What is in store for the future of the Xanadu software ecosystem? * What are your predictions for the near to medium term of quantum computing?
Keep In Touch * josh146 on GitHub * Website * LinkedIn
Picks * Tobias + Knives Out movie * Josh + Baking Sourdough Bread
Closing Announcements * Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. * Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. * If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. * To help other people find the show please leave a review on iTunes and tell your friends and co-workers * Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links * Xanadu AI * Strawberry Fields * PennyLane * Quantum Physics * ASIC == Application Specific Integrated Circuit * FPGA == Field Programmable Gate Array * GPU == Graphics Processing Unit * Quantum Photonics * Qubit * Trapped Ions * Quantum Optics * Coherent Light * Heisenberg’s Uncertainty Principle * Wave/Particle Duality * Continuous Variable Quantum Computation * NetworkX * Tensorflow * The Walrus * Rigetti Computing * PyTorch + Podcast Episode * The Walrus Operator (Assignment Expressions) * Fortran * NumPy * SciPy * IPython + Podcast Episode * Jax * Quantum Machine Learning * Xanadu User Discussion Forum
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Summary Most long-running programs have a need for executing periodic tasks. APScheduler is a mature and open source library that provides all of the features that you need in a task scheduler. In this episode the author, Alex Grönholm, explains how it works, why he created it, and how you can use it in your own applications. He also digs into his plans for the next major release and the forces that are shaping the improved feature set. Spare yourself the pain of triggering events at just the right time and let APScheduler do it for you.
Announcements * Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. * When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! * You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. * Your host as usual is Tobias Macey and today I’m interviewing Alex Grönholm about APScheduler, a library for scheduling tasks in your Python projects
Interview * Introductions * How did you get introduced to Python? * Can you start by describing what APScheduler is and the main use cases that APScheduler is designed for? + What was your movitvation for creating it? * What is the workflow for integrating APScheduler into an application? + In the documentation it says not to run more than one instance of the scheduler, what are some strategies for scaling schedulers? * What are some common architectures for applications that take advantage of APScheduler? + What are some potential pitfalls that developers should be aware of? * Can you describe how APScheduler is implemented and how its design has evolved since you first began working on it? + What have you found to be the most complex or challenging aspects of building or using a scheduling framework? * What are some of the most interesting/innovative/unexpected ways that you have seen APScheduler used? * What are some of the features or capabilities that you have consciously left out? + What design strategies or features of APScheduler are often overlooked or underappreciated? * What are some of the most useful or interesting lessons that you have learned while building and maintaining APScheduler? * When is APScheduler the wrong choice for managing task execution? * What do you have planned for the future of the project?
Keep In Touch * agronholm on GitHub
Picks * Tobias + The Data Exchange Podcast * Alex + Tenacity
Links * APScheduler * PHP * Java * ECMAScript * Celery * ERP == Enterprise Resource Planning * Cron Daemon * RPyC * Zookeeper + Data Engineering Podcast Episode * RethinkDB * Daylight Saving Time * Falsehoods Programmers Believe About Time * PyTZ * Celery Beats * Asphalt Framework + Podcast Episode * AnyIO * Twisted + Podcast Episode * Py2EXE * PyInstaller
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA