Overview#

PLPipes is a progresive framework in the sense that you are not forced to use all of it in every project. You can just take advantage of some of the subsystems offered and ignore the rest if that suits you better.

Subsystems#

So, what are those subsystems?

Configuration: It handles the configuration of your project (e.g. database credentials, file patchs, hyperparameter, neural network definitions, etc.) and it is quite powerful, supporting many features that would help the developer keep everything tidy.

It is the only mandatory component as all the rest build on top of it and use it pervasively.
Database: It handles interaction with databases: connections, transactions, reading and writing tables, integration with data-frame frameworks as pandas or polars, etc.
Actions: like scripts but better!

This layer provides support for running several types of tasks as for instace running some python code, running a query in a database and saving the result as a new table, processing a quarto document, etc.
Runner: It is a Python script which is used to run the actions taking care of setting up the configuration, parsing command line arguments, etc.
Cloud: several packages are offered for calling into common Cloud APIs (Azure, GoogleCloud, AWS, OpenAI, etc.).

Some of then, perform basic functionality as authentication, others offer more advanced features as a FS layer which and unified interface for accessing any supported storage service.

The `PLPipes` mindset#

Even if you can use some of those subsystems independently, in our (quite biased) opinion, it is better when you use all of then together!

The big gain is that all of your projects are going to look the same. When somebody in your team gets to work in an already running project, he will not have to ask how to configure the access to the database, or how to get the data, or how to generate that fancy monthly report because it is going to be the same or very similar to its previous PLPipes projects.

So, how is the typical PLPipes project?

The Actions#

PLPipes projects are organized around actions which can be considered as atomic units of work. Examples of actions are downloading a file, transforming some data or training a model.

Actions are grouped in sequences to create data processing pipelines. Several pipelines can be defined inside one project, and it is even possible to change which actions form a pipeline dynamically depending on the deployment environment, the configuration, command line arguments, etc.

Even if the framework doesn't impose it, actions are usually organized in a similiar fashion. For instance, a common set of actions for a simple project could be:

download
preprocess
train
evaluate
report

In a more complex project, those actions could become sequence actions that call other actions doing smaller tasks, but the global structure is going to remain alike.

The Database#

In the context of PLPipes, a central component is the relational database, which serves as a means of exchanging information between actions. While the file system and other means can be used alternatively, the database is the preferred choice, at least tabular data.

With the default configuration, PLPipes creates a SQLite database in the local file system which is inmediately ready for the developer, whitout any setup work or programming required from its side.

The framework also provides a rich set of functions for common tasks, such as executing queries and reading the data as data frames, appending data frames to tables, and synchronizing tables between databases.

Utilizing a database offers several advantages:

Effortless Data Inspection: You can easily inspect the data using your preferred database GUI client (SQLiteStudio, DBeaver, etc.), a Jupyter notebook, or simply the SQLite CLI. This allows you to explore the data, perform cross-referencing with other project data, and utilize SQL for queries.
Structured Data Design: Working with a database encourages thoughtful data modeling and design.
Schema Documentation: You can document the schema of your database, aiding in understanding and maintaining your data structure.
Clear Data Exchange: As you use the database to pass data between actions, you establish a well-defined interface, enhancing clarity and consistency.

This database-centric approach in PLPipes simplifies data management and empowers you to work efficiently with your project's data resources.

Finally, several local, remote and cloud Databases are supported and can be configured.

The Runner#

The actions (or pipelines) are initiated by the runner, which is essentially a Python script that interfaces with plpipes. It has the capability to handle command-line arguments, configuration files, and environment variables in a unified manner.

A typical runner invocation appears as follows:

python3 bin/run.py train evaluate -s model_name=resnet3

Users can also create custom runners to leverage the framework in various environments, such as Azure FunctionApps, AWS Lambdas, Jupyter notebooks, Spark and more.

The Pervasive Configuration#

All of PLPipes subsystems rely on a central configuration, each with specific expectations for retrieving their data. For example, configuring database connections consistently follows the same pattern across any PLPipes project. Once you've configured it for one, you'll find the process familiar and applicable to all.

While certain parameters may vary (configuring a connection to a local file database differs from configuring one for a cloud-based server like Azure SQL) the fundamental approach remains consistent at a higher level.

This approach also simplifies resource tracking for individual projects, as all resources are clearly declared within the configuration files.

The Helper Modules#

Finally, PLPipes aims to be a rich framwork which could take care of any task related to a data scientest work.

That is for instance the reason why it offers a cloud module and a powerful package for accessing several cloud storage services. Because even when it is not something core to the data scientist work, it is something than in practice we frequently need to do, and so it goes in!

In Summary#

In summary, when using PLPipes, instead of a bunch of scripts, every one doing something different, we have a set of pipelines built on top of actions that use a relational database to store intermediate data and we use a standardized python script to get everything running.

Additionally, it provides a lot of additional modules to make the data scientist life much easier!