Apr 5, 2022
Long story short: I’ve taken a new position in Luxembourg and I have to
find an apartment, in a different country, in a reasonably short time.
TL;DR: I hate apartment hunting, I’ve tried to make it as interesting
as possible by creating a pipeline with different tools and languages
to scrape data from a website with the use of Puppeteer, load data into
a SQLite database using Go and visualizing all the data with Python and Folium.
If you want to look at the final code or follow along with me
you can checkout the project repo on
GitHub.
I initially sketched what I was looking for by priority
After that it was time for the boring stuff: looking for an apartment
that best fits my requirements.
I have a couple of friends in Lux and they all told me to look on the infamous
website AtHome.lu to get an idea on the available
apartments that there are in Lux, so I did.
I don’t like doing this, so I try to make things easier by just
looking at the pictures and if the price looks ok I’ll just bookmark
the thing so that I can look at it later for comparisons.
This quickly becomes hard to do. Some agencies will publish part of the
monthly price, some of them will post the actual monthly price. Each
agency fee is different and it’s not immediately visible. The map
on the website is just awful and it won’t let me compare positions with other
apartments. Needless to say that it is pretty much impossible to choose
the right one with this messy data.
What I’d like to have is a big map with all the apartments that I like on
it and maybe a database to query apartments to show by different parameters.
Let’s see if we can scrape some data off of athome.lu!
After giving a quick look at the website I’ve found out that by selecting
the renting apartments in the Luxembourg area, links start with this URL path
www.athome.lu/en/rent/apartment/luxembourg
, each apartment
has a unique id and the complete address for an apartment looks like this
www.athome.lu/en/rent/apartment/luxembourg/id-642398588.html
.
This is a good start, it means that I can pretty much navigate to a specific apartment
page just by knowing its id.
If I inspect the html source of a single page, I immediately notice that there
is a HUGE json object with a lot of data in it at the bottom of the page,
Let’s see if we can find something interesting in it.
Cool, it seems like a big object to set up the entire page. If we look at
each sub-object of INITIAL_STATE
we find detail
which
seems to be our lucky card.
Great! We don’t even have to scrape data from html, we have all the data of the
apartment in this INITIAL_STATE.detail
object!
How can I access that variable through code though? I don’t have a great experience with it
and I don’t write a lot of JS, but I heard that Puppeteer
is the right tool for the job. Let me try something out
const puppeteer = require('puppeteer');
const fs = require('fs').promises;
async function scrape_json_data(browser, link) {
const page = await browser.newPage();
await page.goto(link);
const obj = await page.evaluate(() => {
var { detail } = __INITIAL_STATE__;
return detail;
});
return obj;
}
(async () => {
const browser = await puppeteer.launch({ headless: true });
// File where I'll save all my preferred apartments' links
const data = await fs.readFile(process.env.LINKS_PATH, "utf-8");
// Read line by line
const links = data.split(/r?n/);
var objs = [];
for (var i in links) {
let id_pattern = /id-(d+)/;
// Take id from final part of each link
let id = links[i].match(id_pattern)[1];
console.log("scraping: " + id);
var obj = await scrape_json_data(browser, links[i]);
if (obj.found) {
// Remove all the superfluous data from detail obj
obj = clean_obj(obj);
// Add link to the obj, somehow it's not included in detail obj :/
obj.link = links[i];
objs.push(obj);
} else {
objs.push({ found: false, listingId: parseInt(id) });
}
}
// Save obj data to json file
fs.writeFile(process.env.JSON_OUT, JSON.stringify(objs));
await browser.close()
})();
The code above will open a headless instance of chrome, go to each
apartment link that is saved in the file at LINKS_PATH
and
spit an array of all the apartment data in a file at JSON_OUT
.
We were lucky this time, we didn’t have to go through scraping html,
and this would have probably been the most boring part of the entire process.
The next steps will be about storing data in a database and visualizing it,
but first let’s write a justfile
(alternative to a Makefile) that will make our life easier when we
need to execute commands.
base := justfile_directory()
json_out := "/tmp/res.json"
links := base + "/homes.txt"
scrape:
LINKS_PATH={{links}}
JSON_OUT={{json_out}}
node scraper/main.js
I can now scrape data by just typing
I want to save all the data to a sqlite database
so that I can conveniently check, query and get
apartments info whenever I want and however I want.
Let’s move away from js and switch to a compiled language,
Go will fit perfectly for this, it’s fast and easy to use.
The binary will parse the entire json file that the scraper created
and load each apartment to the apartment
table in sqlite.
I didn’t show it before, but this is my final, cleaned-from-useless-stuff
Apartment
struct with some tag annotations to read from json and load into
sqlite by using sqlx.
type Apartment struct {
Found bool `json:"found,omitempty" db:"found,omitempty"`
ListingId uint32 `json:"listingId,omitempty" db:"listingId,omitempty"`
ListingAgencyReference string `json:"listingAgencyReference,omitempty" db:"listingAgencyReference,omitempty"`
IsSoldProperty bool `json:"isSoldProperty,omitempty" db:"isSoldProperty,omitempty"`
Region string `json:"region,omitempty" db:"region,omitempty"`
CityName string `json:"cityName,omitempty" db:"cityName,omitempty"`
Lon float64 `json:"lon,omitempty" db:"lon,omitempty"`
Lat float64 `json:"lat,omitempty" db:"lat,omitempty"`
Price int `json:"price,omitempty" db:"price,omitempty"`
ChargesPrice int `json:"chargesPrice,omitempty" db:"chargesPrice,omitempty"`
Caution float32 `json:"caution,omitempty" db:"caution,omitempty"`
AgencyFee string `json:"agency_fee,omitempty" db:"agency_fee,omitempty"`
PropertySubType string `json:"propertySubType,omitempty" db:"propertySubType,omitempty"`
PublisherId int `json:"publisher_id,omitempty" db:"publisher_id,omitempty"`
PublisherRemoteVisit bool `json:"publisher_remote_visit,omitempty" db:"publisher_remote_visit,omitempty"`
PublisherPhone string `json:"publisher_phone,omitempty" db:"publisher_phone,omitempty"`
PublisherName string `json:"publisher_name,omitempty" db:"publisher_name,omitempty"`
PublisherAthomeId string `json:"publisher_athome_id,omitempty" db:"publisher_athome_id,omitempty"`
PropertySurface float64 `json:"propertySurface,omitempty" db:"propertySurface,omitempty"`
BuildingYear string `json:"buildingYear,omitempty" db:"buildingYear,omitempty"`
FloorNumber string `json:"floorNumber,omitempty" db:"floorNumber,omitempty"`
BathroomsCount int `json:"bathroomsCount,omitempty" db:"bathroomsCount,omitempty"`
BedroomsCount int `json:"bedroomsCount,omitempty" db:"bedroomsCount,omitempty"`
BalconiesCount int `json:"balconiesCou