How to build more complex Python projects.. progressively
Every Data professional needs to read this 👀
If I were to start skiing, I’m not going to go straight down a triple black diamond. Instead, I’d probably start on the bunny slope, and slowly make my way up to the black diamond runs.
Similarly, if you’re in the process of learning or building Python projects, don’t start with fine-tuning an LLM model — it’s crazy hard and it’s demoralizing if you can’t figure it out.
So in this article, I’ll walk through the different levels of Python projects. My recommendation is to figure out which level you’re at, start there, and then gradually level up from there.
If you’re a Data Scientist or a Data Analyst, this one is for you!
Level 0: Data cleaning project
This is a crucial step of any project, because it ensures the integrity of any downstream analyses or models you build.
And so, your first Python project will be a data cleaning project. In this project, you’ll likely only need two libraries: pandas and numpy.
Every data cleaning project has 3 components:
Exploring the raw data
Cleaning the data
Validating the data was cleaned correctly
By the way, expect that data cleaning will take a long time. A project like this can take 5-10 hours, especially if it’s the first project you’re building.
Level 1: Exploratory data analysis (EDA)
Feel free to keep building on the previous project, because EDAs are a logical next step to data cleaning.
Not sure what an EDA is? It boils down to one thing: What is the data telling you?
You want to look for trends, outliers and surprises — this is usually where the “story” is. Let me share a few examples to illustrate my point:
“Sales of seasonal products like winter jackets spike sharply in November and December, but online sales are growing faster year over year than in-store sales.” — example of a trend.
“Most customers who churned had below-average engagement. Except for one segment that was highly active but still churned immediately after a pricing change.” — example of an outlier.
“We assumed that customers who clicked on more marketing emails would be more likely to buy, but the data showed the opposite: heavy email openers actually converted less, suggesting fatigue.” — example of a surprise.. something that goes against business intuition.
A key part of EDA is the storytelling component. Because you don’t just want to get insights, you want to convince your business partners they need to pay attention to those insights.
This would be a good point to bring in one of the charting libraries — seaborn or matplotlib. My preference is to use seaborn, because they have more out-of-the-box options but with fewer customization options. Generally, I find seaborn gives me enough control.
Level 2: Automate a process
Automation gives me the chills (of excitement). I think because automation is such an easy way to demonstrate impact. This is exactly what I did in my job at Amazon, which led me to get a promotion. If you want to read more, I wrote about it in this article.
Identify a repetitive process → automate it → save cost and time for your company.
Identify a candidate for automation: Look for something you or your team does often and manually. It might be cleaning weekly reports, combining CSV files, or formatting dashboards. The key is that it’s predictable, takes up time and requires little or no human intervention.
Automate it: Once you’ve identified the process, map out the steps and translate them into Python. Then schedule it to run with cron or Airflow so it happens without you. There is no greater flex (IMO) than a process running while every one is on holiday, or when you’re out on PTO.
Save cost and time for your company: Want to get promoted quick? Make sure to track and share any cost and time savings to show the business impact.
Level 3: Core machine learning model
Okay, we’re finally here. The destination that we’ve been looking forward to. The Machine Learning models!
But before you dive into the advanced AI stuff (I promise that’s coming soon), start with the core ML models. By core ML models, I mean a regression model, clustering model or a tree.
There are many, many possible steps to building out a core ML model. Here are the non-negotiable steps:
Data cleaning
Feature engineering
Feature selection
Test and training split
Train model
Check performance on training and test data
Sanity check model outputs
Model interpretation and storytelling
Documentation (I don’t like doing this either, but it’s a must-do!)
Bonus points: put your model into production. Nothing really tests the rigor of your model, like a model that is in production. Plus, you can build your deployment + MLOps skills too.
Level 4: Advanced AI, like Deep learning or LLMs
This step is where you go wild. Up to this point, I think everyone should do the same types of Python projects — I think about those like the core classes of your major; everyone has to do those classes.
But level 4, is like an elective. Here is where our paths branch, and we can go explore what feels the most exciting for us.
Some ideas to get you started:
Fine-tuning an LLM for Q&A: Adapt a pre-trained transformer to answer domain-specific questions
Explainability Dashboard: Use SHAP or LIME to visualize why a model makes predictions, then package the explanations in a simple dashboard
Image Defect Detection: Train a deep learning model like ResNet or EfficientNet on labeled product images to detect defects such as scratches or misprints
That’s it. Just 5 levels to get you from your first Python project to your first Advanced AI project. Happy building!
Web scraping doesn’t have to be painful. Oxylabs’ Web Scraper API makes webscraping easy and fast.
This API works like a web scraper, but it’s remote. You send a simple API request with the URL. Then the API collects the data and returns the results cleanly structured.
My favorite features of using Oxylabs’ Web Scraper API:
• Handles IP blocks, CAPTCHAs, and site protections automatically
• Scales to thousands of requests with clean, structured output
• You pay only for successful results
Try Oxylabs’ Web Scraper API free for up to 2,000 results → https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=1644&url_id=174

