In this video, I'm going to demonstrate how to web scrape into your Bubble application. And I'm gonna be using the web scraper Page2API. I was working on a recent client project and I tried a number of different web scraper APIs, and I found that Page2API offered the best integration for what I was trying to do with the Bubble API Connector Plugin.
Understanding API documentation
So that's what I'll be demonstrating to you now. If we head into the Bubble API Connector, install this plugin, if you haven't already by Bubble and we'll add another API. This is Page2API, and then we're going to make a call and for the purposes of this demonstration, we will be scraping the H1 tag.
This is a way of identifying in HTML code the most important head on the webpage. So it is quite a common target for web scraping. We're just gonna call it scrape H1. And then we have to dig into the Page2API documentation in order to know how to fill out our API call here.
If you are looking at API documentation personally, I find the easiest one to translate into Bubble is by looking for the CURL. Section. Not Ruby, not Python. This is the easiest one to translate into Bubble.
Bubble API Connector
So I need to make a POST call in the header, I have to make this declaration here 'content type'. In the header content type application, JSON. Okay. And then this little 'd' tells me that the rest of the content here is to go as, as data, but you could also think of it as going in the body of the call. So I have to make that. Okay. And then it's a POST and the we make the call to this address here.
Then there are a few other things we need to add in oh, before we do that. This is a common mistake. I have look at all the time and I'm making API integrations to swap this to action. Data allows you to pull in information if you were a little bit like a 'Do a search for'. So, if you wanted to express a list of time zones and the dropdown, you would use the data action here. Sorry. The use as data, that's a bit confusing for the list of time zone APIs, but action enables you to make this call in a workflow, which is what we want to do. We wanna make the call and then save the result to our database.
The other parts that we need here. So I'm just gonna copy and paste this into the body of my API call and and then let's cuz let's take from an example, there's a lot here that we don't need that we need to edit in order to make it work. The URL we can make it a dynamic value by using the triangle ended brackets. So URL, and then we want to untick from private because we want to be able to insert a value in here in our API call in the workflow.
And then the, we just want to, to target the H1. I'm gonna delete these expressions here, these other lines and making sure that I don't leave a stray comma at the end, otherwise the JSON will be invalid. And I think that oh, we need to put a valid URL out in here. Let's give it a test.
So there's something wrong with my JSON, I've not got a closing bracket, so you can see that you have this parent bracket here and I have one down here but actually the parse content it has its own. So. Let's put it really neatly. I'd have a bracket down here. Let's try that. Okay. Web scraping takes a few moments. But there we go.
So you can see here that it has returned as ' title_ HTML' that's the label that I assigned in the body of the JSON but it's returned the whole H1 portion of HTML. Now there's a way to get around that. If I go back into the documentation and then have a look for data extract, you'll see that I can tell Page2API, what sort of data I want to get back.
I just want to get back text. I can go back in here and by adding in that expression there. That's the label that it returns. And so in fact, we will rename that to page h1 and then let's initialize the call. Like I was saying, this takes a few seconds to complete.
And one reason for that is that I have the scraper using like a real browser. There we go. So the request was made with a real browser and the advantage of that is that most websites it, my experience using web scrapers, they will block an attempt to scrape their content if it doesn't look like it's by a human or like a legitimate indexing bot, like Google's indexing bot.
So I found, I mean, you can turn this off on most web scraping services by not using a real browser you'll save some money per request. But I found that it's just much more reliable. And so they go, I get that call back BBC homepage is the H1 for that page. So my expression works. And then let's let me demonstrate how to add this in to your design of your Bubble app.
So I've got a repeating group here which shows a list of websites. And I want to be able to add a URL in here click scrape and it to be added to my database. So let's do that. When the button has clicked plugins because I have it as an action. I see my API cool here, and then I link this up to my input. Then I'm going want to reset my input so I can place more than one call through very quickly. Lastly, perhaps most importantly, I need to add it to my database. So I have already created a type of website and I'm gonna save my H1 in here. This is the key bit results are step. And then looking for my label, which is page underscore H1.
Just so I know what website I've called I want to reference the URL. So that's the input. And then otherwise, currently I'll be referencing an empty input as I put it after my reset, I'll just pop it there. Then let's give it a refresh. And let's give it a try.
Web scraping limitations
So one of the things you'll realize is you have to with web scrapers that they're not that clever. I mean, rather you have to do a lot of the supportive work, providing them with, with a correct URL yourself. So if I was building this into an app that other users were, were going to be using I would find ways in the expression here to ensure that HTTPS, et cetera, is included that it checks that the URL is valid otherwise it isn't going to work. So let's try it. Okay. There we go. Let's come across. Let's try another one. In fact, let's try and demonstrate it, not working. So let's put in a deliberate error. Okay. Udemy lives on www dot Udemy.
In that instance it worked probably because Udemy's got a redirect setup from the root domain to the www dot but let's make an even more deliberate error.
Okay, there we go. So the API call is throwing up that there is an issue. So you'd want to find a way in your Bubble app of handling errors perhaps nudging users with the placeholder of how to correctly enter in a URL to make the web scraping process as reliable as possible.