Starting My Data Science Portfolio (And Writing About It)

A journal about building, learning, and figuring things out.

Jul 14, 2025

I'm a Master's student in Data Science, and I want to do more than just finish my classes. I want to create something real, something I can learn from and experiment with. Not an assignment, not a group project, but something entirely my own. I'm in the final year of my programme (yes, still attending lectures and coursework), but I decided to start this now anyway.

Before starting this degree, I worked as a software developer. That experience helps me plan projects and build things from scratch, making the start of this portfolio feel a bit more manageable.

This won’t be a perfect tutorial or a polished showcase. I plan to write a weekly journal (we’ll see how that goes) where I share what I’m working on, the choices I make, the problems I encounter, and what I learn along the way.

In this initial post, I'm going to talk about the architecture I chose for the first project in my portfolio. I’ll explain what the project is, what options I considered, and why I made the choices I did, even if I might change my mind later.

💡What I’m Building

For my first portfolio project, I’m building a sentiment analysis app called SentiCheck (yes, this is the best name I could think of 😂). It collects public posts from Bluesky, applies NLP models to score sentiment, and displays trends on an interactive dashboard. The idea is to help users explore how public sentiment varies over time and simulate how teams could use that information to make smarter decisions.

Right now, I’m starting with a single platform (Bluesky), but I’m designing the system to be modular, so I can add other platforms later. The app will allow users to see how sentiment changes around specific topics, hashtags, or time periods. I also want to include filters and charts that make insights easier to explore.

I chose this project because it brings together a few things I’m interested in: natural language processing, building data pipelines, and creating useful interfaces. Most importantly, it gives me the chance to practise transforming raw data into insights, which I believe is at the heart of being a good data scientist.

This isn’t meant to be a comprehensive analytics product but rather a working prototype that brings ideas together and allows for experimentation. Hopefully, by the end, it will feel like something real and useful.

🧱 Architecture Overview

Before writing any code, I wanted to outline a high-level overview of how SentiCheck would work. The system needs to do three main things: collect data from Bluesky, analyse the text, and display the results. I also want the structure to be modular and easy to automate from the start, with a clear path towards deployment once things are more stable. For now, I'm focusing on getting the core pipeline working, but I’m keeping future steps like containerisation and deployment in mind.

🛠️ Tools I'm Using and Why

Python is a language I’m comfortable with, and it integrates well across this project.
The Bluesky Python SDK simplifies API access and lets me focus on the data itself. I’m using the AT Protocol’s search API to fetch public posts by keyword.
Apache Airflow helps automate the pipeline. I’ve used it before and feel confident setting up DAGs for ingestion and processing tasks.
PostgreSQL is structured, reliable, and something I already have experience with.
Hugging Face Transformers give me access to pretrained sentiment models, which means I can get started quickly without building one from the ground up.
Streamlit lets me build a working dashboard quickly without needing a full frontend.
Docker will come in later, once the system is stable, to help with packaging and deployment (likely to the cloud, though I haven’t chosen a platform yet).
Git is there for version control and tracking progress as the project grows. I’ll share the GitHub repo later on once I have something worth showing.

📊System Diagram

To better show how everything connects, here’s a high-level diagram of the architecture, followed by a step-by-step breakdown of what’s happening in the image:

A high-level view of how SentiCheck is structured at the moment.

What’s Happening in the Diagram:

Airflow schedules and triggers each pipeline component.
Bluesky Connector collects public posts from Bluesky.
It stores raw posts in PostgreSQL.
Text Cleaning pulls raw data from the database, cleans it (removing things like URLs and emojis), and stores cleaned posts.
Sentiment Scoring uses a Hugging Face transformer to assign sentiment to each cleaned post.
Sentiment scores are stored back in PostgreSQL.
Streamlit Dashboard reads from PostgreSQL to display trends and results.
Docker will be used later to containerize the full system for deployment.

🧭 What’s Next

Now that I’ve sketched out the architecture, the next step is to start building the ingestion pipeline. I’ll be working on connecting to the Bluesky API using the Python SDK, collecting a sample of public posts, and exploring the structure and quality of the data. Once I have that working, I’ll move on to cleaning the text and experimenting with a sentiment model.

At the same time, I’ve started working on a personal website using Jekyll on GitHub Pages. It will be a place where I can showcase my portfolio projects and link to these blog posts, so everything lives in one space. I’m also using this as a chance to get hands-on experience with Jekyll itself. For the design, I’ve been browsing sites like Dribbble and Behance to get inspiration and figure out what feels clean and readable. It’s still early, but it’s part of the bigger picture.

This early part of the project will probably involve a lot of trial and error, which is exactly why I wanted to document the process. I’m aiming to make steady progress, write honestly about what’s working (and what isn’t), and keep things flexible as I learn.

In the next post, I’ll share how the ingestion process is going and what I’ve learned from playing with the data. Hopefully by then I’ll have something running, even if it’s a little messy.

Thanks for reading. If you're into data, projects, or figuring things out in public, feel free to stick around.