Building Senticheck: Collecting Data, Cleaning Text, and Seeing Progress

Some pipeline progress, Copilot coming through, and a cleaner portfolio layout.

Jul 24, 2025

I’m back with my second journal entry on the Senticheck project. Last time, I outlined the structure I was aiming for, and shared why I chose this project in the first place and how I planned to build it. Since then, I’ve moved from sketches to scripts, and I finally have something running (at least in early form).

In this post, I’ll go over how I started pulling in data, what I’ve built to clean and store it, and how Copilot in VS Code (using Claude Sonnet 4 as the model) sped up my development. I’ll also share a bit about my portfolio website, which now has a homepage (after spending some time tweaking CSS).

Connecting to Bluesky

To start, I needed real posts to work with. I used the Bluesky Python SDK to connect to the AT Protocol’s search API and retrieve posts based on a few keywords. The authentication part went smoothly (there's no need for an API key, just an existing Bluesky handle and a password for the app). Once I had that set up, I wrote a script to collect post text, timestamps, author handles, and other metadata.

Sample output from the Bluesky ingestion script, including some metadata fields.

With that in place, I ran a few tests to see what kind of data I was getting. I added some basic prints to log progress and better understand what the API was returning. That helped me understand the structure of the data and how the API responds. One thing that caused a bit of confusion at the beginning was how the API structures its output. Some fields, like the post text and timestamp, are stored inside a "record" object, while others, like the author's name, are outside of it.

Structure of a single Bluesky post showing fields inside the "record" object

It is a small detail, but it took a moment to figure out. For now, I am working with a small sample of raw posts that I can reuse while building and testing the rest of the pipeline.

Cleaning the Text

Once I had the raw posts, I built a module to clean them up (here Copilot started shining, but more on this later). The idea was to strip away anything that could get in the way of the analysis, like links, emojis, extra spaces, and special characters. I started with a few simple regex patterns, then added more rules as I ran into new edge cases (and I’ll probably have to add more in the future).

In addition to the usual content that needs cleaning, I ran into a few surprises while exploring the real data. One detail I didn’t expect was how some posts are tagged in a different language than they are actually written in (for example, a post written in Japanese but marked as English).

Post written in Japanese but tagged as English by the API

Yes, this still happens despite the API having a language filter 😂, which I assumed would be more accurate. That caught me off guard at first, especially because I was just getting familiar with how the data looked (and I got this surprise right away). I was not planning to do any language detection yet, but I’ll probably need to think about it later or use a multilingual model.

The cleaning function slowly improved as I tested it on more samples. I kept adjusting it until it handled most of what the API was throwing at me. It is not perfect, but it is good enough for now. I made it a separate module so I can reuse it later or swap it out for an existing text cleaning library (I haven’t looked into this yet, but I bet there’s more than one out there) if I decide to go that route. And since I’m storing the original version of each post, I can always go back, try different cleaning approaches on the same data, and see how those changes affect the results.

Example of a post before and after passing through the cleaner module.

Setting Up the Database

Once the posts were retrieved and cleaned, I needed a place to store everything in a way that made it easy to search, update, and scale later on. As I mentioned before, I went with PostgreSQL. I already have some experience with it, and it’s a solid and flexible choice. It felt like the right tool for the job.

I am using SQLAlchemy to handle the connection and manage the database. I like how it keeps things organised and lets me define tables directly in Python (which I think is nice). It also simplifies the usual CRUD operations and makes it easier to update the structure later if needed.

Initially, I created two tables: one for the raw posts and one for the cleaned ones. Each entry includes fields such as post text, timestamp, author, and a few other details that may be useful later. I am also storing each post’s unique ID (its URI) to keep track of the source and avoid duplicates. This makes it easier to link records across tables and handle updates if I decide to reprocess any posts later.

I also created a table to store the sentiment scores, along with model information like the model name and version. To keep track of how each post moves through the pipeline, I added a "Processing Log" table that records steps such as fetching, cleaning, and analysis. I am thinking about adding one more table to record the keywords used when pulling posts. It would work kind of like a search history. I will leave that for later, once the rest of the flow feels more stable.

*Current database setup with separate tables for raw and cleaned posts.*

Using Copilot (and Claude Sonnet 4)

One thing that has made development a lot smoother is using Copilot in VS Code. I was using it with Claude Sonnet 4 as the model, and it has been surprisingly helpful, especially for repetitive parts like writing boilerplate code, cleaning functions, generating docstrings (my favourite use), or setting up basic class structures.

It is not perfect. Sometimes the suggestions miss what I actually need or just feel off. But when it gets it right, it saves real time.

Another feature I found useful was the built-in chat. I could ask questions or describe what I needed, and it would either apply changes directly to the files in my project or help me spot errors. This was especially handy when I needed to make several small edits across different parts of the code.

One thing to be careful about, though, is how easy it is to accept suggestions without thinking them through.

These tools can speed things up, but they can also lead you into trouble if you are not paying attention. For example, it could suggest committing sensitive keys or passwords (yes, this has happened), or generate code that ends up creating a very expensive cloud bill if deployed as-is (and yes, this has already happened and also this one). So while Copilot can be a big help, I try to keep a close eye on what it is doing and make sure I understand the changes before accepting them.

Copilot autocomplete helping me apply changes across the project.

That said, I know Copilot is just one option among many. There are other tools out there like Windsurf or Cursor, which are also based on VS Code but include more AI features (or more precisely, more LLM features). I have not tried them yet, but I am planning to test them out in different projects to see how they compare.

Building the Portfolio Website

Alongside Senticheck, I’ve also been working on the portfolio site where everything will eventually live. I’m building it with Jekyll and deploying it to GitHub Pages. This setup keeps things simple and easy to maintain, without needing to worry about servers or infrastructure.

Right now, the homepage includes my name and a few social links, like my LinkedIn and email. I also added placeholders for both featured projects and blog posts, which will help bring everything together as the work progresses.

Most of the time went into getting the structure and layout right, along with tuning how things looked. But I’m happy with the result so far. It’s clean (at least to me), functional, and gives me a place to keep everything connected.

*Current homepage with featured projects and blog highlights.*

Balancing this with university and working on Senticheck has made time management really important. I have been trying to stay focused on just a few tasks at a time, break each project into small steps that move things forward, and give myself a time limit for each one.

That structure has helped me make steady progress without burning out, even when things start piling up.

What’s Next

The next big step is getting actual sentiment analysis up and running. I already have the structure in place to store results, so now it’s time to pick a model and start testing it on the cleaned posts. I mentioned before that I plan to use a model from Hugging Face, and that’s still the plan. Their platform has everything I need to get things working.

I also want to start working on the orchestration side of the pipeline. I’m planning to use Apache Airflow to manage the different steps, like fetching posts, cleaning them, and eventually running sentiment analysis. That should make the entire flow more reliable and easier to run regularly.

Finally, I’ll keep improving the portfolio site as I go. Right now, the layout is set and the structure is in place, but I still need to create the other pages and start adding real content (which means I also need to actually create more of that content). I want to make sure updates are easy to publish and everything stays organised as the project grows.

A good summary of where things are right now.

There’s still a long way to go, but it’s good to see things starting to come together.

Thanks for reading. If you're into data, side projects, or figuring things out one step at a time, feel free to stick around.