Machine Learning with Friends, Finale

A complete guide with suggestions for side projects

Jan 18, 2023

brown wooden table with chairs — Photo by Devin Berko on Unsplash

Motivation

My motivation for this series was to explain how to successfully do machine learning as a side project with friends. Success here means a small group building an end to end process that begins with data collection and ends with useful outputs. What’s useful is ultimately up to you and your friends, but in this final part of the series I offer a few ideas that might help you get started. I also put all of the TLDRs from parts 1-4 at the end so this can serve as a master guide to get up and running.

Machine learning is software that adapts based on real examples from data. How cool is that?! But also, how hard is that? You want to write software that can effectively rewrite itself to make better decisions based on data it analyzed. I can barely do two of those things at once without falling down. Machine learning is complicated and requires a ton of different skills. Working on a team allows people to shine where they’re strong while learning new things from others.

Tackling a new project with a lot of moving parts usually follows a painful pattern for me. I get excited, make a half-baked plan, and get started assuming I’ll sort out the details as I go. Then I spend loads of time figuring out basic stuff or reworking steps just because it’s new to me. Working with people who have diverse interests and experiences makes this process way more fun and effective. I started working on machine learning side projects four years ago, some solo and some with a team. I hope this summary of what I’ve learned is helpful and inspiring because machine learning is really fun when things actually work.

Lessons learned

In addition to all of the specifics laid out in parts 1-4 (summarized below), here are three important lessons I learned while doing machine learning in my free time.

Solo projects are easy to start but hard to finish. Building a team is hard but it makes quitting even harder. About a year ago, I realized that every project I started by myself had fizzled out while the one side project I did for a friend with a small team was thriving. I wanted to work on machine learning problems in sports so I put together a small group of like minded people to give it a try. It took way longer than expected to get going but now I don’t think I could kill the project if I tried.
Machine learning is mostly software development and data engineering. A deep understanding of statistics, algorithms, decision science, etc. can make you great at machine learning, but none of that matters if you don’t write good code. When I started, I understood these things well conceptually but could barely code my way out of a paper bag. I spent over a year mostly tackling simple software projects before I could confidently do real machine learning work.
Attempting to automate decision making reveals how complex making a decision can be. In an abstract sense, doing machine learning is writing instructions for a computer that will learn how to make decisions from examples. Actually going through this exercise reveals how much unconscious data processing we do as humans to make the simplest decisions. The process of engineering a decision is way more labor intensive than I anticipated.

Ideas for getting started

A critical aspect of machine learning is the need to process a continuous stream of data. It could be clicks on product recommendations or changing stock prices, but the goal is to process cases as they come in. You’re writing software that changes based on examples, not rules defined given a single analysis. Training an algorithm to predict survivors on the Titanic can be a good way to practice but it’s not the real thing because there is no new data. It’s like throwing a football in your backyard compared to playing in a real game.

If you want to do machine learning with friends, work on something that consistently generates new data. It’s good to start with a static dataset but when you’re ready to level up, you don’t want to wait around for another Titanic to sink (God forbid). A great starting point for exploring streams is the Rapid API Hub. They have loads of free APIs that you can use to get started and pretty reasonable pricing for paid services. My examples below all have APIs available in the Rapid Hub but if you get serious about any of these, you’ll likely find other sources.

Note: the examples below are intended to offer ideas, motivation, and starting points with data sources. They’re not step by step instructions for making money doing these things, which should be obvious but the internet is full of get-rich clickbait so I have to say it.

Sports

Why I like it

I loved playing sports growing up and I love watching them now, but I love them even more for machine learning projects. The data is rich and easy to work with so a lot of people can jump in without a ton of prior knowledge. You can focus on a sport you know or learn a new one. You can watch games with your friends to test your predictions. If you get good enough and live in a state where sports betting is legal, you might make some money without needing to build a business or sell a product.

I also like this application because you can pick and choose classification and regression problems. Will the Dodgers win today, how many passing yards will Patrick Mahommes have, will the score in the Warriors game exceed 250, etc. The challenge is in designing strong features and drawing boundaries for training sets. Professional sports are highly competitive and constantly changing so you have to stay on top of things. This is what separates applied machine learning from basic science. The factors that influence diabetes diagnoses could shift slightly over decades while the features that predict wins in the NFL likely change significantly in just a few years. I like this because it pushes you to constantly improve your work.

Where to start

To get started with sports data, I recommend picking a sport you know and enjoy watching then spend a few bucks on a quality historical dataset. Big Data Ball seems to be the best option based on my experience. The Rapid API Hub also has several APIs for different sports with free options if you don’t want to pay anything to get started. Be warned though that if you google anything with the keywords “sports betting data” you’ll be drowning in garbage ads and suggested links for predatory products.

What to consider

The world of sports betting can be unforgiving and intimidating so be prepared if you decide to jump in. First, don’t pay for someone’s picks. The best sports bettors don’t sell picks, they just bet their edge. Second, gambling on sports for fun is different from betting on sports with machine learning. I have an article about decision making and machine learning you should read if you want to give this an honest shot. You’ll need historical odds to test your models, and this Live sports odds API from Rapid API is a credible place to start. Finally, be advised that if you get really good at sports betting, it will be difficult to sustain because sportsbooks are infamous for banning sharp bettors. If one of you gets banned though, the rest of the team can make money before getting banned too so the game can go on for a while. Also, how good would that look on a resume?

Trading

Why I like it

I like trading as a machine learning application partially because it’s polarizing. People either think they can make an easy buck trading on clever insights, or they think trying to beat the market is stupid. Whatever your beliefs, I’d encourage you to try proving yourself wrong with machine learning. I’ll probably catch some heat from my friends in finance for saying this, but “you can’t beat the market” is overstated and deters people from engaging with active investing. You don’t need to compete in every market. You can hone in on whatever you like: blue chip stocks, crypto, options, bonds, etc and on whatever time scale you see fit.

If “beating the market” means investing billions of other people’s dollars to consistently outperform the major index funds, then I agree it’s next to impossible. But if it means investing modest amounts of your own money and placing a handful of well researched trades to beat the S&P 500 slightly, then I wouldn’t bet against you. The reality is that markets are messy and full of different players operating with varying abilities. If your primary goal is to work on some cool machine learning challenges with friends, trading offers exactly that plus a non-zero chance to make some money if you approach it thoughtfully.

Aside from challenging your assumptions, I have two main reasons for suggesting trading as a machine learning application. The first is that the data is very clean and easy to understand time-series. This makes it quick to start while taking a lifetime to master, which is a hallmark of any compelling game. The second is that it forces you to do rigorous backtesting. The feedback is punishingly fast and honest. If there’s leakage in your models or bugs in your code that make the performance look better than it is, you’ll find out in the wild.

Where to start

The fastest and easiest way to get started is with this yfinance python tutorial. This is a library that has historical daily prices from Yahoo Finance going back decades. They also have higher resolution data but the history is limited. This is a great way to jump in and start building a basic prediction/backtesting framework. Once you get your feet under you, check out the Rapid API finance section for a variety of APIs that might help your efforts. Advances in Financial Machine Learning is also a great resource. If/when you want to try some programmatic trading, Alpaca is one of the most trusted platforms. When you create an account and deposit funds, you also get access to data at no extra cost.

What to consider

Finally, I’ll encourage you to approach this with curiosity and to measure progress with how much you learn, not how much you earn. Ultimately, the more you learn the better chance you’ll have to make money but if you expect to make money, you’ll probably quit before you learn much. This area will naturally push you to explore new ideas and data sources to improve your predictions. For example, you might start using satellite images to predict retail traffic or weather forecasts to anticipate crop yields. The possibilities are effectively endless so it’s up to you how far you want to take it. Even if you don’t make a ton of money, taking this seriously will make you and your teammates more informed investors and better engineers.

Weather

Why I like it

Weather and climate are increasingly important for many applications ranging from energy consumption to travel disruptions and food prices. Working with weather and forecast data has a substantially higher barrier to entry compared to sports and trading, but the investment will be worth it for many teams. Having a firm handle on these data sources paired with an ability to make reasonable and timely predictions is incredibly valuable. For example, accurately predicting extreme weather in corn growing regions of the US could provide an edge in trading commodity futures. If you’re looking to develop machine learning skills and domain expertise that could lead to a successful business or job opportunity, this is a very promising area.

Where to start

To get started, I’d recommend predicting extreme outlier events in different locations like heat waves, torrential rains, etc. You could use this capability to trade commodities in the short run or anticipate changes in real estate prices in the long run. Changing climate will impact where people want to live and what they produce. If you’re ahead of the pack, you just might be able to make some life changing investments. Some data sources to get started with are the open weather API. the dark sky API, and the weather forecasts API. If you eventually want to go directly to the raw source data from the weather stations, take a look at NOAA. This will require more heavy lifting to make the data useful for machine learning applications, but it’s the best approach if you want to build a business around this eventually.

What to consider

Compared to sports and trading, machine learning for weather prediction is a long term play. It’s going to be harder to get started and harder to build a team but it could be far more valuable in the end. If you get good at this, you’ll naturally find your way to satellite images which is an incredibly exciting data source. Companies like Planet launch their own satellites and sell very high resolution images of the entire planet. Their data is expensive but I know they have programs for startups and nonprofits to get them started at steep discounts. There are also open access datasets you can start with for free but they tend to be lower resolution and frequency.

Final thoughts

Machine learning and AI have never been hotter as things like ChatGPT and Stable Diffusion dominate headlines. There’s loads of interest in this work and endless content about how to use these things to do specific tasks, but not nearly enough about how to really get your hands in the machine. I wrote this series as a way to open my garage door and show the makers of the data science and engineering world how to do this stuff end to end on your own, not just run someone else’s code. It’s only a matter of time before developing or tinkering with machine learning systems is only possible in a commercial setting. I hope my words motivate and inspire some of you to build a team and learn by doing.

TLDR from Parts 1-4

Part 1

Machine Learning with Friends, Part 1: Building a team and getting organized.

Here’s what to set up and how much it will cost per user. By the way, I’m assuming most of you won’t have more than 10 people in your crew.

Buy a domain from something like Namecheap for $10 or less per year.
Sign up for the Google workspace starter plan for $6/user/month to manage the organization and provide a catch-all for collaboration if you don’t opt for other tools.
Set up a free workspace in Slack and add users through their Google accounts you just created. The basic version is free for unlimited users. The pro version for $9/user/month is worth it for video calls with screen sharing if you do a lot of remote pair programming.
Set up an account with Atlassian and add your team through their Google accounts then activate Jira and Confluence for task tracking and documentation. Both are free if you have 10 or fewer users then Jira is $7.50/user/month and Confluence is $5.50/user/month after that.

If you have 10 or fewer users, you’re only out $6/month for each of them with this setup. If you opt for Slack Pro and your team grows beyond 10, you’ll pay about $28/month for each but then you’re probably on your way to a legit startup.

Part 2

Machine Learning with Friends, Part 2: Consistent operating systems and flexible compute.

Machine learning is a team sport and to do it effectively, you need both consistency and flexibility. With respect to computing, you need consistent operating systems so that different people can contribute to a broader effort on different devices. Additionally, you need the flexibility to scale hardware up or down depending on the task.

The TLDR:

Get everyone on unix/linux. If they use a Mac, all set. If they use Windows, install WSL2 and run linux on their PC.
Pick a cloud platform for on-demand compute and commit to an educational resource if you aren’t experienced with setting up accounts and managing users. If this is all new to you, choose AWS or GCP and buy a course from Udemy or Coursera to learn how to set things up properly. The learning curve is steep if you’re brand new to this stuff and mistakes can be expensive. It’s worth it though.
Set up your account with a budget and thoughtful permissions for your team. Don’t just make everyone an admin without a good reason.
Pick an option for on-demand virtual machines. If you’re new to this and using AWS, SageMaker Notebook instances are easy to get started with. In GCP, try the Vertex AI Workbench. If someone on your team is more experienced and you don’t want to pay the premium on these services, I’ll encourage you to set up your own instances but make sure that it’s as easy for the rest of your team to use them as it is for you.
Use cloud services as needed, don’t just do everything in the cloud by default.

It’s hard to say what to expect in terms of cost because your needs will vary depending on what you’re doing and how many people are doing it. With that said, if you use cloud services thoughtfully and only spend as needed, I wouldn’t be surprised if most of your monthly bills are in the $10-$50 range. When it’s just me using notebook instances a few times a week to run memory-intensive jobs for a few hours at a time, I don’t spend more than $10/month.

Part 3

Machine Learning with Friends, Part 3: the data plumbing needed to crawl, walk, and run.

Too long; didn’t read? Bummer, this is a good one. Oh well, enjoy doom scrolling with the 15 minutes you didn’t spend reading this.

Setup a cloud account in AWS or GCP if you haven’t already. If you’ve never done this before, buy an AWS or GCP fundamentals course from Udemy or Coursera to walk you through it.
Create buckets for data and block all public access. Explicitly define permissions so that only your users can access the data. You’ll be tempted to knock the locks off when someone has access issues. Don’t do it.
Crawl:
1. Store all of your raw data in a bucket that’s read-only for most users. Only grant write permissions to the resource or user that adds new raw data as part of a process. Organize common files in their own directories (folders) and name them consistently, ideally with timestamps.
2. Create an analytics bucket for processed and transformed data. Data here can be deleted or overwritten but try to keep useful things stable.
3. Establish local access to S3 using a credentials file. Don’t put your access key and secret key in code even if you plan to remove it before sharing. To access data using credentials. you’ll often use one of the first two options described here. This is also a great resource for setting up your credentials file and interacting with S3 using plain old python (boto3).
Walk:
1. Be sure your data files are organized according to common schemas, i.e., each directory contains only files that have the same schema (columns).
2. Set up a serverless database like Athena (AWS) or BigQuery (GCP) to query data directly from storage with SQL. This removes bottlenecks and you only pay for what you query unlike a traditional database that runs constantly.
3. Query Athena in python using boto3 to load only the data you need into memory.
4. Explore Glue (AWS) or Dataflow (GCP) to automate data transformations.
Run:
1. Use Snowflake for your serverless database and DBT for transformation pipelines. It will cost slightly more than the AWS or GCP native options but it’s worth it.
2. Read the section if you want to know why but it’s my strongest recommendation if you want to do this at scale. Here’s the main reference I use for structuring DBT projects though.

Part 4

Machine Learning with Friends, Part 4: Building a codebase for your ML side hustle.

Machine learning is complicated. It’s even more complicated when you do it as a team. To keep your friends as friends, build a solid codebase, keep it clean, and use the same tools. This won’t be any fun if you can’t do cool stuff together.

Use Github, Gitlab, or Bitbucket for your repos. Doesn’t matter which one but pick one and start writing code collaboratively.
1. If you aren’t confident with git, practice with the command line before relying on a code editor to do it for you. Oh shit git and setup tutorial for beginners helps.
2. Don’t use gitflow, just squash and merge.
Use python and SQL for basically everything.
1. Python is a great general purpose programming language and the dominant language used in machine learning. If you want to use something else, do it at your own peril.
2. Do as much as possible with SQL. It’s optimized for data transformation and aggregation. This will also keep your datasets organized and available for everyone.
Use virtual environments or containers to manage dependencies.
1. Anaconda isn’t the worst starting point to get comfortable with python but use the plain old base python virtual environment as soon as you have your sea legs. It’s dead simple and easy to share in the repo.
2. Use docker containers when your environments get big and they're a pain to rebuild over and over.
Build a base with pandas, numpy, scikit-learn, and your favorite visualization library. Add a deep learning framework if necessary.
1. Pandas allows you to easily work with data in dataframes. Numpy provides numerical operations like random number generation and sampling.
2. Data visualization is important but what you use is not. Matplotlib is standard issue but I prefer plotly.
3. If you want to work with images or natural language, invest some time in a deep learning framework like TensorFlow or PyTorch. These can be difficult to learn but worth the effort if you need deep learning to tackle your problem.

Hands in the Machine

Discussion about this post