Designing and Building a Personal Spotify Data Lakehouse

09 Apr, 2026

The code related to this article can be found in the following repository. Feel free to reach out for any questions or suggestions.

I. Introduction
II. The Data Challenge
III. Architecture and Implementation
IV. Pipeline Design
V. What's Next

I. Introduction

The Problem With Spotify Wrapped

Every December, the internet loses its mind over Spotify Wrapped where people share their top artists, their listening minutes, their "audio aura",... It trends on every platform, and the media covers it like it's a real cultural event. And yet, every year, Wrapped manages to disappoint in more ways than one can imagine. The stats are very surface level, the methodology is opaque where it tells you your top 5 artists but not how your taste evolved over the year. It tells the total minutes listened but not when, or how the mood shifted across seasons. It doesn't tell the skip rate, whether we actually finish the songs we start, or how our weekend listening differs from weekdays. These are not hard questions as the data exists but Spotify just chooses not to answer them, or answers them poorly.

For a company sitting on one of the richest behavioral datasets in the world, Wrapped feels like it was put together by a sleep deprived team during a hackathon: high on hype but lacking depth. What’s more baffling is that, despite having all that valuable data, they chose to invent and over hype the “Audio Aura” in collaboration with an “aura reader”.

So I decided to take matters into my own hands and build my own analytics from my listening history. But before diving in, we first need a solid foundation: one that holds the data needed to answer our questions and evolves continuously with every listening event.

Starting Without Requirements

In any data project, the first thing we are supposed to do is define our requirements: What questions are you trying to answer? What metrics matter? What does the final dashboard look like? We skipped that step entirely because to be completely honest, sitting on my desk and trying to write requirements down would have put this project in the "not fun anymore" territory. Instead, we went in with a simple (and arguably naive) goal: get as much data as possible, about everything, and figure out what to do with it later. We wanted a robust data lakehouse that updates continuously, because I listen to Spotify all day long, and wanted the data to reflect that in near real-time. Furthermore, if I had to list all the things I would want to do with my Spotify listening data, the list would be endless.

This made the project significantly harder once it was started because when we do not know what questions we want to answer, every design decision becomes a tradeoff with no clear winner. Should we normalize aggressively or keep things denormalized? How much enrichment is enough? What grain should the fact table be at? Is adding weather data worth the hassle? It is hard to score when you do not know where the target is. But it also made the project more interesting and more fun: by not constraining ourselves to a predefined set of metrics, we ended up building something flexible enough to answer questions we had not thought of yet.

II. The Data Challenge

What Spotify Gives Us (Spoiler: Not Much)

The Spotify API has a single relevant endpoint for listening history, /me/player/recently-played, which returns the last 50 tracks we listened to and that is it. No pagination beyond that, no historical data, just the last 50. If we want anything older, we have to request a data export fromthe Spotify account privacy settings. This takes days to arrive and comes in two flavors: a standard export (basic, no track IDs, local timezone) and an extended export (detailed, with track IDs, UTC timestamps, skip/shuffle/offline metadata). The standard export gives us endTime, artistName, trackName, and msPlayed, while the extended one adds platform, country, shuffle, skipped, offline, incognito mode, and crucially, a Spotify URI we can use to link back to the catalog.

Even with the extended export, we are still missing a lot: no genre information, no audio features (danceability, energy, valence), no album metadata beyond the name, no artist popularity or follower counts. In fact, for the past few years, Spotify has been gradually phasing out its API and shows no signs of slowing down or restoring the old features. While this is understandable, it is not necessarily justifiable.

For a project covering nearly a decade of listening history, the Spotify API’s rate limits are extremely restrictive as we can make roughly one request per second before being throttled, and the search endpoint, which we rely on to resolve track IDs for standard export records, is even more aggressive. To avoid being blocked entirely, we had to insert 35-second pauses after each request. But despite this, after a couple of hours we still do get blocked, requiring a 24-hour wait to resume. Enriching thousands of tracks this way takes days, making it highly impractical and very frustrating.

Turning to Alternatives

Since Spotify was not going to give us genre and popularity data at scale, we needed an alternative. The two serious options were MusicBrainz and Last.fm. The former is the open-source music encyclopedia: comprehensive, community maintained, and well structured. But its genre/tag system is sparse for many tracks, and its API requires matching by ISRC or fuzzy name matching, which introduces complexity and error rates. Last.fm won for several reasons. First, its track.getInfo endpoint returns community-curated tags, global listener counts, and global play counts, all in a single call. Second, its artist level tags are better populated than track level ones, offering us a reliable fallback. Third, the API is generous: even with minimal sleep between requests (we use 1 second, even though that 0.5 seconds works), it handles sustained batch loads without issues. The main quirk is that its JSON responses are inconsistent (tags can be a single object or a list), but that is easy to handle in code and is no big deal.

We use Last.fm for two enrichment passes: one at the track level (tags, listeners, playcount) and one at the artist level (genre tags). A noise filter strips out useless tags like "seen live", "favourite", and "spotify" that pollute the tag space.

Enter The Dumps:

While researching data sources, we discovered that someone had scraped and published large scale Spotify catalog dumps online: tracks, albums, artists, audio features, genre mappings, the works. Hundreds of gigabytes of structured data covering most of the Spotify catalog. Is this pushing into a gray area in terms of compliance? Sure. But Spotify has been systematically restricting API access, killing useful endpoints (audio features, related artists), tightening rate limits, and generally making it harder for developers and researchers to work with their platform. They removed the audio features API in late 2024, which was one of the most useful endpoints for music analysis. They did this while sitting on all this data themselves and only surfacing it through a flashy but shallow year end marketing campaign.

So we used the dumps with audio features (danceability, energy, valence, tempo, acousticness, and more), artist genres from Spotify's own taxonomy, album labels, UPCs, and popularity scores. The dump enrichment pipeline runs in five phases: track ID resolution (matching export records that have no Spotify URI), track detail fill, artist enrichment, album enrichment, and audio features. It runs locally because the dumps are around 100GB and uploading them to our data lake was not worth it.

III. Architecture and Implementation

The Medallion Architecture

The medallion architecture (Bronze, Silver, Gold) is a natural fit for this project because the data arrives in wildly different formats from different sources at different times, and needs progressive refinement:

Bronze is the raw layer where JSON files land here exactly as they came from the source: API responses, export files, enrichment results. No transformations, no cleaning so every file preserves the original schema. This matters because Spotify's API response schema has changed over time, export formats differ between standard and extended, and enrichment results have their own structure. Trying to normalize all of this on ingest would be fragile and lossy.

Silver is where the chaos becomes structure through a single transform pipeline that reads every Bronze source, handles schema differences, deduplicates across sources, and outputs a clean star schema in Parquet files. The deduplication is priority based: if the same play event appears in the API data, the extended export, and the standard export, the API version wins because it has the richest metadata. The star schema has dimension tables for artists, albums, tracks, Last.fm track tags, Last.fm artist tags, and audio features, plus a fact table of listening events partitioned by year and month.

Gold is the final analytics layer. It joins dimensions into wide tables (dim_tracks with album + audio features + tags, dim_artists with genres + tags), builds a wide fact table (fact_plays with 52 columns including derived fields like mood quadrant, completion percentage, time dimensions), and pre-computes aggregations at daily, monthly, per artist, and per track granularity.

The alternative would have been to dump everything into a single wide table from the start. But with data arriving from 5+ sources asynchronously, with enrichment passes that run days apart, and with a need to re-transform without re-ingesting, the layered approach keeps things manageable. Bronze is append only and immutable, while Silver can be fully rebuilt from Bronze at any time. Gold can be fully rebuilt from Silver, so each layer has a clear contract.

The Star Schema

The Silver layer uses a textbook star schema because the data naturally decomposes into entities (artists, albums, tracks) and events (listening). The fact table (listening_events) is at the grain of one play event and carries foreign keys to each dimension table. This design pays off in the Gold layer, where different analyses need different joins: daily aggregations only need the fact table and artist tags, track-level stats need the tracks dimension, the wide fact_plays table joins everything together for ad-hoc queries, but the pre-computed aggregations avoid repeating expensive joins.

One interesting design choice: the fact table joins to artists on artist_name rather than artist_id, because standard export records only have names, no IDs. This is a pragmatic compromise as name based joins are imprecise (artist name collisions could exist), but the alternative would be losing the ability to enrich a large portion of the historical data.

Infrastructure: Minimal and Serverless

The infrastructure is deliberately minimal, fully defined in Terraform with four modules:

Azure Data Lake Storage Gen2 with hierarchical namespace enabled. This gives directory level operations and Hive-style partitioning support while remaining compatible with the standard Blob API. Standard LRS keeps costs low since this is not mission critical data.

Azure Container Registry (Basic SKU) to host the Docker image with no admin credentials, just RBAC-based pulls via managed identity.

Azure Key Vault for Spotify API credentials (client ID, client secret, refresh token). These are added manually after infrastructure provisioning to keep them out of Terraform state entirely, and the workload only gets read access (Get, List).

Azure Container Apps Jobs for compute. This was the key infrastructure decision as Container Apps Jobs are purposedly built for scheduled batch workloads: you define a cron schedule, a Docker image, and resource limits, and Azure runs it with zero idle cost. Compared to Azure Functions (language and runtime constraints, cold starts, execution time limits) or AKS (massive operational overhead for a simple cron job), Container Apps Jobs hit the sweet spot. Two jobs run hourly: ingest at :05 (1 CPU, 2 Gi, 30 minutes timeout) and transform-gold at :20 (2 CPU, 4 Gi, 90 minutes timeout).

A User-Assigned Managed Identity ties everything together: both jobs share the same identity, which has Storage Blob Data Contributor access (scoped to the storage account) and AcrPull access (scoped to the registry). No credentials in environment variables, no secrets to rotate.

The entire infrastructure costs next to nothing because everything runs on consumption based pricing and storage costs are minimal for Parquet files. Container Apps Jobs only bill for the seconds they actually run, while The registry is on the cheapest tier available.

Polars

The processing stack is Python with Polars, and this was not the first choice for full transparency as we started working on the project while waiting for Spotify to send us the exports. The initial prototype used Pandas and worked for small datasets but fell apart when processing the full export history (hundreds of thousands of events across tens of JSON files). Pandas' eager evaluation model meant loading everything into memory at once, and its single-threaded execution made transforms painfully slow.

DuckDB was the next candidate as it is fast and handles Parquet natively, in addition to its excellent SQL support. But it introduced friction in the Azure integration layer since reading from and writing to Azure Blob Storage with Hive partitioning required workarounds, and the fsspec integration was not as smooth as we needed. DuckDB is great for local analytics, but as the core engine of a pipeline that reads from and writes to cloud storage on every run, the impedance mismatch was noticeable (side note: this may be a skill issue because the errors we kept receiving were so annoying that we just gave up after one hour. Plus we tried this on Azure Function and not in a container).

Polars turned out to be the right tool: it is very fast since it is Rust based, it handles lazy evaluation smoothly so it can process datasets larger than memory, it reads Parquet with predicate pushdown so partitioned scans only touch the files they need, and, crucially, it has first class fsspec/adlfs integration for Azure Blob Storage, meaning we can pl.scan_parquet("abfs://container/path/**/*.parquet") and it just works with the right storage options.

The pipeline reads hundreds of JSON files in parallel (ThreadPoolExecutor with 64 workers for Bronze downloads), transforms them through Polars expressions, and writes partitioned Parquet back to Azure. The Silver transform, which processes the entire listening history from scratch, completes in under two minutes and Polars made that possible.

IV. Pipeline Design

The system has seven pipelines (The Magnificent Seven), each with a single responsibility:

Ingestion: fetches recent tracks from the Spotify API using cursor-based pagination. The cursor (a Unix millisecond timestamp) is persisted to blob storage so that each run picks up exactly where the last one left off.
Backfill: uploads historical Spotify export files, validates records against Pydantic schemas, filters out podcasts, and stores the raw bytes to preserve the original data.
Silver Transform: is the core pipeline, which reads all Bronze sources, handles format differences, deduplicates, and outputs the star schema. This is the most complex piece because it has to reconcile data from five different sources with different schemas.
Enrichment resolves missing track IDs by searching the Spotify API. This is painfully slow (35-second sleeps between batches) and runs locally only.
Last.fm Enrichment fetches genre tags and popularity data from Last.fm API in two passes (tracks, then artists).
Dump Enrichment runs five phases against local catalog dumps to fill in audio features, genres, labels, and other metadata.
Gold builds the analytics layer with wide dimensions, the wide fact table, and four aggregation tables.

For scheduled execution in Azure, two composite commands exist: ingest (runs alone at :05) and transform-gold (runs transform then gold at :20) with a 15-minute gap to give ingestion time to complete before the transform reads the new data.

Key Design Decisions

Pydantic for validation at the boundary: all external data (API responses, export records, Last.fm responses) passes through Pydantic models before entering Bronze in order to be able to catch schema changes early. Pydantic's AliasChoices feature handles the standard vs. extended export format differences cleanly: one model, two field name mappings.
Source-priority deduplication: the same play event can appear in the API data, the extended export, and the standard export. Rather than complex merge logic, each source gets a priority number (API=1, extended=2, standard=3), and when duplicates are detected (same track at the same timestamp), only the highest priority version survives. Simple, deterministic, and easy to debug.
Base64-encoded filenames for enrichment: each enrichment result is stored as a single JSON file in Bronze, keyed by track ID or artist ID. Since these IDs can contain characters unsafe for blob storage paths, they are base64 encoded. This also makes idempotency trivial: if the file exists, the enrichment was already attempted, skip it.
Hive partitioning everywhere: Bronze API data is partitioned by year=/month=/day=, Silver and Gold fact tables are partitioned by year=/month=. This makes time range queries efficient (Polars only reads the partitions it needs) and keeps individual Parquet files at a reasonable size.
Immutable Bronze, rebuildable Silver and Gold: Bronze is append-only and we never modify or delete Bronze files. Silver is fully derived from Bronze and can be rebuilt at any time by re-running the transform. Gold is fully derived from Silver. This means if we add a new enrichment source or fix a bug in the transform logic, we just re-run from that layer forward and no data migration is needed.
Single configuration object: the Settings class (Pydantic-based) handles all environment differences. DEV mode uses Azurite (a local Azure Storage emulator), while LOCAL mode uses Azure CLI credentials, and PROD mode uses managed identity and fetches Spotify secrets from the Key Vault. Downstream code never knows which mode it is running in, it just gets storage options and credentials from Settings.

V. What's Next

The lakehouse is running, the data flows in hourly, and the Gold layer has enough pre-computed aggregations to answer most questions we care about now. But the analytics layer on top (dashboards, visualizations, deeper analysis) is still to come and will be a separate project. The point was never just the dashboard but was proving that with a bit of engineering, we can build something that gives more insight into our own listening habits than the company that has all the data. Spotify Wrapped tells us the top 5 artists of last year, but this tells us how our taste and mood shifted across a decade of listening, which genres we gravitate to on weekends vs. weekdays, the skip rate by artist, and whether we actually finish the songs we start.

The data was always there, Spotify just was not going to give us the insights that matter.