Skip to the end for the TLDR.
And again, not selling anything even though I think everything here is technically free.
Motivation
Why do we need to build a codebase to do machine learning? For the engineers, this is obvious but for those who don’t build software for a living, it’s a reasonable question. The short answer is to keep complex, repeatable tasks organized for ongoing development. From data ingestion to live prediction, machine learning work should be done entirely in code. It’s complicated stuff that requires incredible attention to detail while managing many moving parts. Throw a few people at the same project working asynchronously without an organized codebase and you’ll be drowning in tech debt in no time. Individuals don’t do machine learning work; they develop the code that does machine learning work. So the question isn’t why do we need a codebase, but rather how do we build one that works for us?
Git repositories
In my experience, most data scientists are comfortable with git but don’t often use it effectively. They can clone repos and use the code in their projects but don’t always feel confident creating pull requests and contributing to broader efforts. If this is you, it’s time to build that confidence because this isn’t going to work otherwise. Engineers live in git so if you can recruit one who will encourage your team to use it properly, do it. There are three primary options for git: Github, Gitlab, and Bitbucket. They’re all functionally the same but offer different features and pricing models. All three have a free option for a team of any size plus entry-level and premium paid options. The paid plans are targeting enterprise customers who have security and devops needs that almost certainly don’t matter to you now. Someone on your team is bound to have a strong opinion about this so go with whatever keeps them from throwing a fit or put it to a vote. You can always change it later.
Once everyone is set up with an account, they need to install git locally and start building confidence with it. Git has more capabilities than I care to ever know about so it can be overwhelming. How you interact with it is also not very intuitive compared to other tools which adds a layer of friction. I recommend understanding just the core concepts first. If you can explain them and why they matter to your mom, you’re good. Then start practicing with the command line. You can use git commands in your editor after you’ve built some confidence in the terminal. Future you will be very grateful you did this. Also, start with simple exercises that have nothing to do with your real projects. Push a program that prints “hello world” then have someone change it to “hi world” and “hey world” through pull requests to make sure you can go through the motions. I like this setup tutorial for beginners that covers the core concepts with examples.
Eventually, you’re going to screw something up that makes you want to nuke the repo and start over. I’ve personally pushed this red button many times but I have faith you can be better than me. Oh shit git is equally funny and useful for exactly these kinds of situations.
As you gain more experience with git as a team, you’ll eventually need to settle on a paradigm for using it. If you’re still new and building confidence, ignore this part and come back to it in a few months. But for those of you who are making this decision now, do not use git flow for machine learning, especially with a small team. It’s needlessly complex for your use cases and will quickly suck the fun out of machine learning with friends. If you need more convincing, read this. Just squash and merge instead.
Python and SQL
Now that you’re ready to build as a team, what languages are primarily going to fill out your codebase? Short answer: python and SQL. Can you do machine learning without python? Sure. Should you? There are edge cases, but almost certainly no. The overwhelming majority of data scientists and machine learning engineers use python for at least one critical aspect of their work. Do some googling if you don’t believe me. Of course, there are exceptions like high performance algorithm development that has to be done in a language like C but then that code gets packaged up in a python library for general use. Python currently appears to be the most commonly used programming language across all software development, not just machine learning so there’s that.
Given my time in academia and policy research, I primarily used R and C++ until around 2018 when I fully transitioned to python. It was one of the best choices I’ve made in my career. Change is hard, but if you want to do machine learning, use python. Things are constantly changing though so I’m not willing to place any bets on the long term relevance of python or anything for that matter. But for now, it’s the industry standard. Don’t be contrarian about programming languages. Be a team player and follow best practices.
With that out of the way, the more important thing to address is the balance of python and SQL in machine learning development. Python is a fantastic general purpose language so you can do all kinds of data processing with it but remember, SQL stands for structured query language. It’s designed for optimizing data extraction and transformation. I constantly see people (I’m guilty of it too) running a basic select query to load raw tables into memory then doing all of this preprocessing gymnastics in pandas and numpy. This works but then that nicely transformed data depends on python modules rather than just being available in a data warehouse. As a general rule, do as much in SQL as you can and make processed data available as views or tables in your warehouse. It might take longer to develop but it’s worth it in the long run.
Note: there is a lot of innovation in the space of data pipelining and things are changing constantly. Amazon announced new developments recently that move closer to their vision of a zero ETL future. Meanwhile, platforms like Databricks offer automated intelligent data pipelines that aim to make ETL easier and more robust. I’m all for ease and stability when it comes to data transformations but if a tool hides or obscures the dirty details of the real data, I don’t recommend it. That said, basically all of these things boil down to SQL so use it as much as you can in your codebase and stay open to new tools or services that make running it better.
Environments
If you missed Machine Learning with Friends Part 2, I strongly recommend not using Windows for development. That doesn’t mean you shouldn’t use it for other things but if anyone on your team works on a Windows machine, have them install linux using WSL2 and here’s a nice guide too. So I’m assuming everyone is working in linux or Mac OSX, which is unix.
As you do machine learning work, you’re going to use a variety of libraries and tools. The more libraries you have installed, the more likely it is that some of their dependencies will have conflicts. The pattern often looks like this: you want to help a teammate with something they built with a library like TensorFlow but when you install the required version, several dependencies conflict with libraries you need for another project. Rather than rebuild your base environment for every project, you can create distinct environments for different workflows to resolve the conflicts. You can’t just install everything you’ll ever need in one base environment. It would be great if that were possible, but that’s just not going to happen when the libraries you use are developed over time by different groups of people.
Many of you have Anaconda installed already and use it for managing environments. This is a great way to get started and I recommend it if you’re new to python. However, you can easily create and manage environments using plain old vanilla python. Store all of your dependencies in a requirements file then it’s one line of code to create the environment, one to activate it, and one to install the requirements. It makes what you’re doing crystal clear in a minimalist way that’s easy to store in a repo for others to use. The regular old python docs on virtual environments have all you need. Some of you will reach a point where constantly rebuilding virtual environments creates a lot of overhead. In this case, it’s time to use containers. Others have written extensively about this so here are two references I find helpful: Docker desktop and
Frameworks
Now that you have a repo and virtual environments set up, it’s time to choose a framework in python for machine learning development. By framework, I mean a set of libraries you’ll use to work with data and train/test algorithms. The foundational libraries for working with data are pandas and numpy. Pandas allows you to easily read/write data and work with it in dataframes. It also has built-in data visualization capabilities through matplotlib, which is inspired by matlab plots. Personally, I prefer plotly for more complex visualization but you should use whatever you like. Numpy has functionality for numerical operations like generating random numbers, sampling, etc. Pandas, numpy, and a visualization library are your home base for working with data in python. Here are some helpful references:
Pandas, Numpy, Python Cheatsheet
All you want to know about matplotlib
Getting started with plotly in python
Plotly through the eyes of Kaggle
Arguably, the most popular framework for general machine learning is scikit-learn. You’ll see both scikit-learn and sklearn often but they’re essentially synonymous. Scikit-learn is easy to use and incredibly efficient since it’s built on numpy, scipy, and matplotlib. It’s a great framework to start with, but it’s also nice for quick prototypes if you use another framework for large-scale development/deployment. I strongly recommend that every machine learning team start by building out their codebase with pandas, numpy, scikit-learn, and a visualization library. If anyone on your team is new to data science or python, it’s a good idea to work through some of the modules in a structured course. My top recommendation is Jose Portilla on Udemy. He’s an excellent instructor with some of the best courses out there.
Beyond the essential foundation, there are frameworks for deep learning and large scale neural networks. The two main frameworks are TensorFlow (developed by Google) and PyTorch (developed by Facebook, err Meta). If you’re working with images or natural language, you’ll need to invest heavily in one of them in addition to your scikit-learn base. It’s possible to do all of your machine learning in one of these, but it can be frustrating when you want to do something quickly or just for fun. TensorFlow and PyTorch are like big power tools, which is awesome, but sometimes all you need is a screwdriver so it’s good to have hand tools around.
I don’t have a strong opinion about using one of these over the other. If you’re not familiar with either and you want to get started, I recommend picking one that’s better suited for the types of problems you’re working on rather than what you think will be more popular. Working with tensors can be challenging at first if you’re mostly used to tabular data. It’s more important to use the framework that makes learning easier so you can progress faster. If you need to switch at some point, the transition won’t be that hard if you already know what you’re doing. With that said, here are some relevant resources comparing TensorFlow and PyTorch.
TLDR
Machine learning is complicated. It’s even more complicated when you do it as a team. To keep your friends as friends, build a solid codebase, keep it clean, use the same tools. This won’t be any fun if you can’t do cool stuff together.
Use Github, Gitlab, or Bitbucket for your repos. Doesn’t matter which one but pick one and start writing code collaboratively.
If you aren’t confident with git, practice with the command line before relying on a code editor to do it for you. Oh shit git and setup tutorial for beginners helps.
Don’t use gitflow, just squash and merge.
Use python and SQL for basically everything.
Python is a great general purpose programming language and the dominant language used in machine learning. If you want to use something else, do it at your own peril.
Do as much as possible with SQL. It’s optimized for data transformation and aggregation. This will also keep your datasets organized and available for everyone.
Use virtual environments or containers to manage dependencies.
Anaconda isn’t the worst starting point to get comfortable with python but use the plain old base python virtual environment as soon as you have your sea legs. It’s dead simple and easy to share in the repo.
Use docker containers when your environments get big and they're a pain to rebuild over and over.
Build a base with pandas, numpy, scikit-learn, and your favorite visualization library. Add a deep learning framework if necessary.
Pandas allows you to easily work with data in dataframes. Numpy provides numerical operations like random number generation and sampling.
Data visualization is important but what you use is not. Matplotlib is standard issue but I prefer plotly.
If you want to work with images or natural language, invest some time in a deep learning framework like TensorFlow or PyTorch. These can be difficult to learn but worth the effort if you need deep learning to tackle your problem.
Up next is the final article in this series. I’ll summarize everything then describe a few areas of interest and how I would approach them if you want to do machine learning with friends.