Scraping over 100K GTA 5 mods
19/09/2023
Typescript
Svelte
NodeJS
Grand Theft Auto V, one of the most popular games ever, has a vibrant modding scene. However, the issue I encountered is that most sites that host mods, don’t have very good searching functionality or are slow. Another thing being that not every mod author uploads their mod to every site. So I thought, why not make something myself?
Webscraping
To make a site that lists mods, I well, need a list of mods. To get this list I scraped four of the most popular modding websites. I used NodeJS with Typescript and the X-ray npm package, which allows me to parse the HTML structure of the sites I’m scraping. Doing this I found over 120K mods, but there is a problem. Some mod creators upload their mods to multiple sites causing there to be duplicate mods in my list, which is something I obviously don’t want. To combat this I merged mods by mod- and author name. After the great merge I “only” had 110K mods left.
Search API
I now had a nice list of mods but no good way to search through them. So the next step was making an API that returns a set of mods for a given search query. To do this I used Express with Typescript. The initial API looks something like this:
app.get('/search', (req: Request, res: Response) => {
const searchKeywords = req.query.q.split(' ');
const searchResults = Search(searchKeywords);
res.json(searchResults);
});
Now the Search
function. (this can be a lot better, but it works) To break it down, it loops through all the mods and gets a rating for each of them, if it’s greater than the magic MIN_RATING
number it’s added in the results.
function Search(searchKeywords: string[]) {
const foundMods = [];
for (const mod in allMods) {
const rating = RateMod(mod, searchKeywords);
if (rating > MIN_RATING) foundMods.push(mod);
}
return foundMods;
}
Then the RateMod
function is quite complicated, so I’ve simplified it here:
function RateMod(mod: Mod, searchKeywords: string[]) {
let rating = 0;
// If the title includes the exact keywords
if (mod.title.includes(searchKeywords.join(' '))) rating += 10;
// For every word check if it's in the title
for (const word in searchKeywords) {
// If it is, add a point
if (mod.title.includes(word)) rating += 1;
}
// Make the total downloads of the mod influence the rating
// Mods with higher amounts of downloads are usually better
rating += mod.downloads * 0.001;
return rating;
}
That’s not all of the RateMod
function, but it’s a good overview to understand it a bit better. I also added the functionality to search by author name by using @username
. That looks like this:
// Check if there is any keyword that starts with '@'
const author = searchKeywords.find((word) => word.startsWith('@'));
if (author) {
// If there is a requested author, we check if the mod
// includes this author. If it doesn't we can simply
// return as we only want to show mods by this author
if (!mod.author.includes(author)) return false;
}
Interface
Now I have an API endpoint that returns a list of mods for a given query, nice. The next step is to create a (semi)good-looking site. My favorite frontend framework is currently Sveltekit, so that’s what I used for this project. For some components I used Flowbite svelte along with Tailwindcss for other styling. With this UI I wanted something that’s simple and intuitive to use. I also wanted to list a lot of mods because other mod sites don’t do that and require more clicking through navigation menus. While I may not be a professional graphic designer I’d say I’ve achieved my goals with this UI.
Final notes
Later, I added caching with Redis to improve performance. This project was definitely fun to make and it turned out actually useful (judging by the analytics others agree).
That concludes this post, thanks for reading! If you’d like to check out the final product you can find it here: msw.boris.foo