Scraping over 100K GTA 5 mods

19/09/2023

Typescript

Svelte

NodeJS

Grand Theft Auto V, one of the most popular games ever, has a vibrant modding scene. However, the issue I encountered is that most sites that host mods, don’t have very good searching functionality or are slow. Another thing being that not every mod author uploads their mod to every site. So I thought, why not make something myself?



Webscraping

To make a site that lists mods, I well, need a list of mods. To get this list I scraped four of the most popular modding websites. I used NodeJS with Typescript and the X-ray npm package, which allows me to parse the HTML structure of the sites I’m scraping. Doing this I found over 120K mods, but there is a problem. Some mod creators upload their mods to multiple sites causing there to be duplicate mods in my list, which is something I obviously don’t want. To combat this I merged mods by mod- and author name. After the great merge I “only” had 110K mods left.



Search API

I now had a nice list of mods but no good way to search through them. So the next step was making an API that returns a set of mods for a given search query. To do this I used Express with Typescript. The initial API looks something like this:


app.get('/search', (req: Request, res: Response) => {
	const searchKeywords = req.query.q.split(' ');
	const searchResults = Search(searchKeywords);
	res.json(searchResults);
});

Now the Search function. (this can be a lot better, but it works) To break it down, it loops through all the mods and gets a rating for each of them, if it’s greater than the magic MIN_RATING number it’s added in the results.


function Search(searchKeywords: string[]) {
	const foundMods = [];

	for (const mod in allMods) {
		const rating = RateMod(mod, searchKeywords);
		if (rating > MIN_RATING) foundMods.push(mod);
	}

	return foundMods;
}

Then the RateMod function is quite complicated, so I’ve simplified it here:


function RateMod(mod: Mod, searchKeywords: string[]) {
	let rating = 0;

	// If the title includes the exact keywords
	if (mod.title.includes(searchKeywords.join(' '))) rating += 10;

	// For every word check if it's in the title
	for (const word in searchKeywords) {
		// If it is, add a point
		if (mod.title.includes(word)) rating += 1;
	}

	// Make the total downloads of the mod influence the rating
	// Mods with higher amounts of downloads are usually better
	rating += mod.downloads * 0.001;

	return rating;
}

That’s not all of the RateMod function, but it’s a good overview to understand it a bit better. I also added the functionality to search by author name by using @username. That looks like this:


// Check if there is any keyword that starts with '@'
const author = searchKeywords.find((word) => word.startsWith('@'));

if (author) {
	// If there is a requested author, we check if the mod
	// includes this author. If it doesn't we can simply
	// return as we only want to show mods by this author
	if (!mod.author.includes(author)) return false;
}


Interface

Now I have an API endpoint that returns a list of mods for a given query, nice. The next step is to create a (semi)good-looking site. My favorite frontend framework is currently Sveltekit, so that’s what I used for this project. For some components I used Flowbite svelte along with Tailwindcss for other styling. With this UI I wanted something that’s simple and intuitive to use. I also wanted to list a lot of mods because other mod sites don’t do that and require more clicking through navigation menus. While I may not be a professional graphic designer I’d say I’ve achieved my goals with this UI.


Modscraper preview



Final notes

Later, I added caching with Redis to improve performance. This project was definitely fun to make and it turned out actually useful (judging by the analytics others agree).


That concludes this post, thanks for reading! If you’d like to check out the final product you can find it here: msw.boris.foo

Copyright © 2023-2024 boris.foo, All rights reserved.