AI Foundation Series: The Ghost in the Data: My First 17,000 Houses

In my first blog on foundation series, I learned that for an AI, everything is a number a “Vector.” But this week, I scaled up. I went from looking at a single house to looking at 17,000 houses at once.

To handle this, I moved away from simple math and met Pandas, the industry-standard library for data manipulation. If AI is the engine, Pandas is the fuel refiner.

Meeting the Crowd

Using Google Colab, I loaded the “California Housing” dataset. It’s a famous collection of data points from the 1990 census. When you load 17,000 rows, you aren’t just looking at data; you’re looking at a digital map of human life.

The “Data Detective” Phase

One thing they don’t tell you in AI hype videos is that data is messy. I ran a statistical “health check” using df.describe(), and that’s when I found the “ghosts.”

In any massive dataset, there are outliers—data points that make no sense. I found “houses” that seemed to have 30,000 rooms but only a handful of people living in them. As an AI builder, my job isn’t just to write code; it’s to be a detective. If I let the AI “study” these impossible houses, it will learn a distorted version of reality.

image

Visualizing the Wealth Gap

The highlight of this session was turning those 17,000 rows into a visual story. By plotting longitude and latitude and coloring the dots by house value, a heatmap of California appeared right in my notebook.

You could see the “heat” (high prices) hugging the coastline of San Francisco and Los Angeles, while the inland areas remained “cool.” It was a powerful reminder: AI doesn’t just predict; it reveals patterns that are already there.

image

The Coastal Burn: The “red” dots (expensive houses) form a tight line along the coast. To an AI, this means the longitude and latitude variables are incredibly powerful—maybe more than the number of rooms.

The Population Clusters: The size of the dots (s=df["population"]/100) shows where people are crammed together. Usually, where the dots are biggest and reddest, the AI has the most “difficulty” predicting prices because the market is so volatile.

The Inland Cool: Notice the blue/green “cool” zones in the middle. This is where your AI model will likely be most “accurate” because the price swings are less extreme.

Painting with Data

The most powerful moment of my week wasn’t writing the code; it was seeing the ‘heat’ appear. By plotting 17,000 coordinates, I watched a heatmap of California emerge from a void of numbers. You could literally see the wealth ‘hugging’ the coastline of Los Angeles and San Francisco in bright red, while the inland areas remained a cool, affordable blue. It was a stark reminder: AI doesn’t just predict; it reveals patterns that are already there. As a builder, I realized that if I don’t help the AI understand ‘location,’ it will never truly understand ‘value.

Github Repo: https://github.com/ankitsrivastava/ai-foundation-series/blob/main/AI_foundation_Series.ipynb?short_path=5388780

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top