Skip to the end for the TLDR but I recommend taking the time with this one.
Motivation
If you read parts 1 and 2, you should be familiar with setting up an organization along with a cloud platform to supplement local work done in unix/linux operating systems. You might prefer AWS, GCP or something else, but I primarily refer to AWS with some GCP terminology in this article because it’s more familiar for me. The purpose of this is to offer guidance and structure to help you make the best decisions for your setup rather than give explicit instructions. Hopefully it isn’t too confusing if you aren’t familiar with AWS. In this article, I get into data storage organization and access. For the engineers in the audience, some of this might feel like preaching to the choir but the aim toward model building and analytics should offer perspective. For those of you who regard data the same way you do the clean water that comes out of your faucet, i.e., you don’t think about it unless something goes wrong, get ready to roll up your sleeves and do some data plumbing.
Data is one of the most important and overlooked aspects of machine learning. Too often, I see it treated like the quiet step sister while complex algorithms try to convince everyone their big feet will fit into any glass slipper. If your data isn’t accurate, organized, and accessible, nothing else is worth doing until it is. Just a reminder, this is not a tutorial for setting up specific data pipelines and warehouses, but I do link to some great resources for these things. My goal with this article is to explain what you need, why you need it, and what to consider when it comes to data for doing machine learning with friends. You should come away from this with a clear path forward. It will require some effort and patience but it’s worth it. After initial setup, I present three phases for a new team: crawl, walk, and run.
Setup
Data is the key ingredient for all of your machine learning projects. Whether you’re predicting floods from satellite images, who to trade in your fantasy league, or where to flip an Airbnb, the data must be organized, accurate, and accessible for your team. Much of the data generated by modern processes isn’t formatted for machine learning by default. It’s often collected, stored and organized to support things like fulfilling orders, securing transactions, etc. As a result, we have to do preprocessing to get data into the format we need for machine learning, and in many cases, we only want subsets of larger datasets to develop algorithms. So your setup should be conducive to general data transformations and flexible subsetting. For example, you want to combine some CSV files into a long narrow table, pivot it into a wide table, and select only the rows from a single month. This should be easy to do without loading everything into memory.
If you’re just getting started or trying to save money, you might be tempted to store data in Google Drive or share it between people via email. It’s not worth it. Storage in S3 (alternatively Cloud Storage in GCP) is dirt cheap, like pennies per GB per month. The limiting factor with cloud storage isn’t cost, it’s experience. If you’re totally new to this, it’s going to take some time to find your sea legs but trust me, it’s worth it.
If you didn’t read part 2, you need a cloud account. I strongly recommend following along with a structured course from Udemy or Coursera if you’ve never set up an account on your own. It will minimize the frustration and reduce the risk of doing something stupid that costs you money or makes you a target on the internet. Personally, I like the instructor Stephane Maarek on Udemy. I used a few of his courses when I was getting started with AWS and it made a huge difference. This certified developer course is more than you’ll need but it will take you through account setup and there’s a whole section on S3 you can skip to after initial setup.
An important thing to note is that you should protect your buckets by default and be explicit about permissions. Even if you aren’t working with sensitive data, I doubt you want to make it available to the entire internet by default. This will create some access problems with at least one person on your team at some point but don’t give in to the temptation to blow off security. If your house key gets stuck in the lock, do you remove the door? Didn’t think so. Also, S3 is not a shared file system like Dropbox or Google Drive so don’t expect it to function like one. It’s just raw storage that you can interact with programmatically or in limited ways through the console. This can be annoying at first when you’re used to well designed interfaces but machine learning is mostly done programmatically anyway.
Crawl
The foundation is storing raw data files in S3 with consistent naming and organizational structure. Given that basic setup, the crawl option is to programmatically read and write individual files as needed in the environment where you do development, e.g. a python notebook, without a query layer. Say you scrape a website for fantasy player updates daily and produce a json file, then that json should have the date in its name and it should be stored in a directory (folder) with the other files like it. When you want to access all of the data from one week, you would load the files from those days and do your preprocessing in the same place where you do analysis or modeling. This is the crawl option because it requires the bare minimum: reading and writing files from a bucket.
Organization
First, be sure all of your raw data is read-only so no one can accidentally delete or overwrite it. Any preprocessing or transformation that you do with the raw data should produce new files that you can change or delete as needed. Again, storage is cheap and effectively infinite here so it’s better to create new files than to risk corrupting the source of truth. Your naming convention is up to you but I recommend leading file names with timestamps in a standard format, e.g., “2022-03-31_13:17:51_scraped_football.json”. This makes it straightforward to write scripts to pull specific data explicitly based on key attributes that are in the filenames.
Organize storage with two top level buckets: raw and analytics. The raw bucket should only permit specific users or resources to write and everyone else is read-only. Ideally, your data is only captured programmatically but if you do some manual uploading or scraping, do it through a different role in your account so you don’t mix preprocessing with ingestion by accident. The analytics bucket will be more loose in terms of permissions because it’s where you’ll store resulting data from preprocessing or transformations that are constantly in development. Some of you might be thinking this sounds like a data lake, and it is in a minimalist kind of way. The main objectives are to separate raw data and processed data, and to keep things tidy through consistent indexing and cataloging.
Access
Data access in this setup is essentially reading individual files directly from S3. You can think of it like using a big shared hard drive with controlled permissions. The downside is that you have to load entire files to access any data in them, which can be an issue when you have very large files or a lot of them. If that’s not the case, then this should work fine for some time. When you want to access S3 directly from your local environment, you need to use your AWS credentials, ie., access key and secret key. There are a few options for this and you’ll often see the first two options described here.
I don’t recommend ever having credentials written into code, even if you remove them later. I prefer to use a credentials file. This reduces the chance that your credentials are exposed in code by accident, and it makes it easy to manage multiple AWS accounts through named profiles. This is a great resource for setting up your credentials file and interacting with S3 using plain old python (boto3).
There’s also a library for treating S3 like a file system. This is pretty convenient and it’s actually wrapped into the newer versions of pandas so when you use SageMaker notebooks you can read and write to S3 with standard pandas functions using the S3 location as if it were the file path. Fair warning: I’ve run into conflicts with s3fs and other libraries so I tend not to rely on it.
Walk
As your team matures and you collect more data, you’ll eventually reach a point where you can’t load all of the necessary files into memory to access the data you need for a particular task. For example, say you make a daily API request to fetch raw data and save the results in a CSV that’s roughly 500MB. Now you want a handful of rows from each file spanning two years. If you try to load over 700 files that are 500MB each all at once, you’ll run out of memory. You could write a clever python script to work around it but that will fail too when you have one file that’s too large to load into memory by itself. This is where a query layer allows you to extract the data you need using SQL before loading into memory in python.
When you reach this point, I recommend using Athena in AWS or BigQuery in GCP. Both charge $5/TB queried, which means you only pay for what you use and it’s pretty cheap. Contrast this with setting up a conventional database that runs 24/7 on a large server that can cost thousands of dollars or more each month. Tools like Athena and BigQuery make working with very large amounts of data incredibly approachable.
Organization
To make use of Athena or another serverless database, you’ll want to organize your data into raw and analytics zones the way I describe above, but you also need to group files with common schemas. That is, files from a common source that have the same columns and will be viewed as a single table should be in the same directory with a consistent naming convention. This topic alone has been written about extensively and it’s worth going down the rabbithole if you’re interested, but for now I’ll assume you just want to be able to query from large and growing datasets without running into bottlenecks.
Access
It’s a good idea to get comfortable with your query engine in the console first. You can write some simple SQL queries to get a feel for the tool but you’ll quickly want to move to something more programmatic and streamlined. You can always use the Athena or BigQuery interface for development if you need to crack a pesky SQL puzzle but if you just want to run a query and load the results directly into a jupyter notebook, you can use boto3 to interact with Athena directly in python.
This will get you pretty far when working with large datasets for your machine learning projects, but you might reach a point where you want to automate and scale some of the data transformations in your workflow. Again, this is a rabbit hole we don’t need to go down here but just in case you’re curious, the next phase would be to implement an ETL tool like Glue in AWS or Dataflow in GCP. If you’re fetching large amounts of data on a regular basis and routinely training algorithms with it, it’s a good idea to explore these options.
Run
Now I’m finally going to acknowledge the elephant in the room: some of you are doing machine learning with friends because you want to develop a kickass solution and launch a startup. For others, it might be a big open source project or something not for profit, but you still want to scale to infinity. All of these are awesome things to do and you can do them with a surprisingly lightweight data stack. It’s possible to do this entirely within a single cloud platform using services like Athena and Glue but there are other tools and I strongly recommend them if you want to scale.
I’ve used many of these tools in different contexts and I am by no means an expert in any of them. However, I am a machine learning enthusiast and I can speak to my hands-on experience with them. My best experience so far with respect to data storage, organization, and access has been with a combination of Snowflake and DBT. In this machine learning context, Snowflake is primarily a cloud-based data warehousing platform. It’s thoughtfully designed, cost-effective, and incredibly scalable. DBT, or data build tool, is an open source framework that combines SQL with software engineering best practices to make data transformations easy and reliable. DBT also pairs very nicely with Snowflake.
I was introduced to Snowflake years ago and didn’t think much of it until recently when one of the best engineers I know got me working with it and DBT in a single stack. It’s an understatement to say that it’s completely changed the way I approach data engineering and analytics. Setting everything up properly will take some time and effort but it’s worth it if you want to launch something bigger than a side hustle or hobby project.
Getting Started
Getting started with Snowflake and DBT is extremely well-documented in multiple places. It’s not worth my time or yours to try to reproduce that here so I’ll mainly point you to the good sources and add a little extra color. To start, Snowflake has a lab specifically designed for this setup that they estimate takes about an hour to complete.
I think the 57 minutes they estimate is close, but If you’re new to this kind of thing, carve out a few hours and be patient with yourself. There’s the hard setup time but you’re also learning a new framework and you can only do that at your speed. DBT has a step by step tutorial on setting up and connecting to Snowflake as well.
I recommend starting with one of these and using the other to cross-reference if you get confused or stuck. When I’m working with something new, I like to triangulate multiple sources because nothing is perfect. Documentation gets stale, software changes, and people make mistakes so I try not to rely too much on one resource, especially when I’m a beginner.
Note: DBT has also an active discourse community. I don’t actively contribute but it is a useful resource.
Benefits
The first major benefit of using Snowflake and DBT for a fledgling team is cost. DBT has open source tools you can use entirely for free as well as a free cloud account option for a single developer. Snowflake pricing can look a little ambiguous but from personal experience, it ends up being quite inexpensive. It’s ultimately a serverless database so you only pay for the storage and compute you use. They’re marginally more expensive compared to other cloud services like AWS but the experience is worth it for me personally.
Beyond the cost benefits, Snowflake makes data incredibly accessible for people with a variety of skill sets. Their user interface is clean, intuitive, and fast. You only need basic SQL knowledge to query and export data for analysis/modeling or you can query directly from python using the Snowflake connector. They also present decent data summaries for low level exploration. While Snowflake doesn’t do everything, I find myself spending more time in their app to get my hands on data.
Finally, DBT makes writing and deploying data transformation code easy and intuitive. It essentially comes down to writing SQL views and keeping them organized. The era of building complex ETL systems that require constant maintenance and development is nearing an end. If you can write SQL, you can automate and scale data transformations with DBT. I can’t overstate the value of this. When people can make transformed data available to the rest of the team through the data warehouse and the logic is entirely in SQL, you can grow faster than you thought possible. Here’s the main reference I use for structuring DBT projects.
TLDR
Too long; didn’t read? Bummer, this is a good one. Oh well, enjoy doom scrolling with the 15 minutes you didn’t spend reading this.
Setup a cloud account in AWS or GCP if you haven’t already. If you’ve never done this before, buy an AWS or GCP fundamentals course from Udemy or Coursera to walk you through it.
Create buckets for data and block all public access. Explicitly define permissions so that only your users can access the data. You’ll be tempted to knock the locks off when someone has access issues. Don’t do it.
Crawl:
Store all of your raw data in a bucket that’s read-only for most users. Only grant write permissions to the resource or user that adds new raw data as part of a process. Organize common files in their own directories (folders) and name them consistently, ideally with timestamps.
Create an analytics bucket for processed and transformed data. Data here can be deleted or overwritten but try to keep useful things stable.
Establish local access to S3 using a credentials file. Don’t put your access key and secret key in code even if you plan to remove it before sharing.
Walk:
Be sure your data files are organized according to common schemas, i.e., each directory contains only files that have the same schema (columns).
Set up a serverless database like Athena (AWS) or BigQuery (GCP) to query data directly from storage with SQL. This removes bottlenecks and you only pay for what you query unlike a traditional database that runs constantly.
Query Athena in python using boto3 to load only the data you need into memory.
Explore Glue (AWS) or Dataflow (GCP) to automate data transformations.
Run:
Use Snowflake for your serverless database and DBT for transformation pipelines. It will cost slightly more than the AWS or GCP native options but it’s worth it.
Read the section if you want to know why but it’s my strongest recommendation if you want to do this at scale.