There are levels to Data portfolio projects. It’s okay to start with a beginner-level project, but you should be leveling up as you build more technical skills.
One easy way to level up your portfolio project — use a LIVE data source instead of a static data source.
What is the difference between the two? A static data source uses pre-downloaded datasets that never change (like a CSV file you found on Kaggle), while a live data source pulls fresh, real-time information that updates automatically (like current stock prices or social media feeds).
Why is this impressive to employers? Because it mirrors the exact kind of work you’d do on a job. As a Data Scientist or Analyst, you’ll almost always be working with live, messy, ever-changing data — rarely static files.
In this article, I’m going to walk through 3 options (with increasing level of complexity) for integrating a live data source into your model.
Connect to a database
Connect to a data API
Scrape your own data from websites
Level 1: Connect to a Database
Connecting to a database is a great first step into dynamic data. Instead of working with static CSV files, you're pulling information that gets updated regularly by real applications.
To do this, you’ll need
Basic SQL knowledge
Access to a database (some options listed below)
A database connection library (like
sqlite3
,psycopg2
) if you’re working in Python.
I couldn’t find a lot of public SQL databases, but here are a two options to get you started
Level 2: Connect to a data API
APIs are how applications share data — think about it like a power outlet. You don’t care how electricity is generated, you just plug in and get access to the energy (data in this case 😉) you need.
Feeling intimidated? Don’t be. It’s really not that complex, but it is important skill to have.
APIs are everywhere, and employers love to see candidates who automate data pulling, without manual downloads.
Connecting to a data API can be broken down into these 3 simple steps:
Get access through an API key
Make a request, usually an HTTP call (
GET
) that asks the API for specific data.Parse the response, most APIs return JSON, which you can easily load into a dataframe for analysis.
Where to find data APIs? There are so, so, so many options for data APIs, that you can use to level up your portfolio. Find a full list of 18 sources on my data-portfolio-handbook GitHub repo — Data APIs section.
Level 3: Scrape your own data
Now this is where things get really complex, but where you can show off Advanced Data skills.
This is the fun stuff. Once you learn to scrape data, you (potentially) have unlimited access to the internet. That means datasets that no one else has in their portfolio. A truly unique project, that is yours and only yours.
The two Python libraries that I’ve used most for webscraping are: BeautifulSoup and Selenium.
How do you actually scrape data? (Or watch this video to get started.)
Inspect the website – open your browser’s developer tools, look at the page structure (HTML tags, classes, IDs).
Send a request – use Python’s
requests
library to fetch the raw HTML from a URL.Parse the page – with BeautifulSoup, you can extract specific elements (like all the titles, prices, or links).
[Optional] Handle interactivity – if the site uses lots of JavaScript, you’ll need Selenium to simulate clicks, scrolls, or logins.
Store the data – save it into a database or CSV so you can analyze it later.
But… there are risks and gotchas with scraping data. So make sure you’re considering these before starting to scrape any data (don’t say I didn’t warn you).
Legal/Ethical boundaries: always check a site’s Terms of Service; some explicitly ban scraping.
IP blocking: you might get rate-limited or banned if you scrape too aggressively. I personally never log in to any accounts when I’m scraping.
Performance: scraping at scale requires handling retries, delays, and parallelization. This is like the super advanced stuff, don’t worry about this until it becomes a bottleneck.
———
Want to learn more about scraping data? You’re in luck!
The #1 web scraping even takes place online on October 1st. It’s OxyCon 2025, hosted by Oxylabs!!
Completely free to attend, so you have no excuse 😉
They have lots of interesting talks and panels lined up, and I’m personally most looking forward to these 3 topics:
1️⃣ Live demonstration on integrating the extracted web data into a personal project with the help of an AI prompt.
2️⃣ How to structure messy scraped data efficiently with an API (live demo).
3️⃣ AI-Scraper loop: How scraped data improves AI models and how AI enhances scraping accuracy.
4️⃣ How Cursor and MCPs work together in a real-world scraping setup.
See the full agenda and save your spot here: https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=1644&url_id=175