Scrape html pages

Hello,

After my tour of the 24 record stores in Brooklyn taking pics, I wanted to make an app. So, I’ve started to build - https://recordstoresnyc.glideapp.io - and I’m pretty happy about it!

But now I want to complicate things (obviously) and add information coming from Discogs profile for those stores who have it.

So far, for now, I made it retrieve the first page of listings from a profile page with =IMPORTHTML(“Vinyl Records, CDs, and More from FaceRecordsNYC For Sale at Discogs Marketplace”, “table”,3)

But now I’m facing some issues and questions:

  • how can I retrieve the image, name, and price of the records? At the moment with =importhtml I’m importing a field with more generic info that I don’t need. So, for the image, I’m trying with the below, with no luck

=IMPORTXML(“Vinyl Records, CDs, and More from FaceRecordsNYC For Sale at Discogs Marketplace”, “//*[@id=‘pjax_container’]/table/tbody/tr[1]/td[1]/a/img/data-src”)

  • can I make pagination to retrieve all pages of a profie to get all their records?

Basically, what I need is: If the record store has a Discogs account, I want to display the list of records with image and price, scraping their Discogs page.

Sounds easy! But not enough for me :joy:— any help would be highly appreciated!

For pagination, you could modify the URL to grab 250 results and do multiple queries to try to retrieve maybe 10 results.

https://www.discogs.com/seller/FaceRecordsNYC/profile?sort=listed%2Cdesc&limit=**250**&page=**1**

Very cool, I like the pictures you took.

They do have an API that apparently gives you direct access to the data but you should check their TOC and Polices to see what is allowed and what is not. By using the API you can get access to all that you need.

That said, I have used Google script to retrieve and add and update data into Mailchimp via their API, and it was a pretty challenge for me. So for someone without programming experience it could be significantly more difficult to pull off.


1 Like

Ah that’s cool! Haven’t noticed I could do the 250 results, thanks. However, how would I have to set multiple queries in a sheet?

Here is a link to their API if you are interested in exploring that route.
https://www.discogs.com/developers

Hi George! Thanks.

I know about their API but not being a developer I’m having hard time to understand how to integrate them, so I was looking into a “simpler” solution with importhtml/importxml or any other trick. I was thinking to use https://parabola.io/ to make it easier, but I still haven’t looked in detail.

I was thinking of just putting an ImportHtml query every 250 rows or something like that, and just changing the page number for each query. Didn’t put a whole lot of thought into it. :wink:

Hi @mrdobelina, I think that the best way for you is by using parabola.io but you can also try something like Simple Scraper

Loop

IMPORTJSON (it’s a custom script)

And IMPORTXML (I’ve talked about it in this post)

Hope it helps you, let me know if you need help with any of these tools/tips.

1 Like

Awesome, thank you @ionamol! Lots of things to read and learn. Will check and let you know if I have any questions :pray:

1 Like