Edge Browser 8
By Richard Rost

3 years ago

Get Page Text, HTML, Scrape Page w FindBetween

S M L XL FS | Slo Reg Fast 2x | Bookmark

In this Microsoft Access tutorial, I'm going to show you how to use the new Edge Browser Control.

In Part 8, I'm going to show you how to get the text of a page that's in your web browser. I'll also show you how to get the HTML of the page, and then I'll show you how to use my FindBetween function to locate specific information on that page, like an account balance. That's called "scraping" a webpage.

Members

There is no extended cut, but here is the database download:

Database Download - Gold Members

Silver Members and up get access to view Extended Cut videos, when available. Gold Members can download the files from class plus get access to the Code Vault. If you're not a member, Join Today!

Prerequisites

Recommended Courses

Access Developer Courses

Javascript DOM References

Learn More

FREE Access Beginner Level 1
FREE Access Quick Start in 30 Minutes
Access Level 2 for just $1

Free Templates

TechHelp Free Templates
Blank Template
Contact Management
Order Entry & Invoicing
More Access Templates

Resources

Diamond Sponsors - Information on our Sponsors
Mailing List - Get emails when new videos released
Consulting - Need help with your database
Tip Jar - Your tips are graciously accepted
Merch Store - Get your swag here!

Questions?

Please feel free to post your questions or comments below or post them in the Forums.

Keywords

TechHelp Access, Edge Browser Control Access, Get Page Text Access, FindBetween function, Access Web Browser Automation, Web Scraping in Access, Extract Text Access Browser, HTML Retrieval, Access Query Web Page Text, Data Scraping Microsoft, Programmatic Web Browsing, Access VBA Page Text, Access Extract Webpage Content, Browser Control VBA, Automated Data Retrieval, Web Text Fetching Access.

Comments for Edge Browser 8

Age	Subject	From
3 years	Mobile View Edge Browser	Mark Anstis
3 years	Edge Browser Frame	Michael Duncan

Subscribe to Edge Browser 8
Get notifications when this page is updated

Intro

In this video, I'll show you how to use the Microsoft Edge Browser Control in Microsoft Access to retrieve text and HTML from web pages, grab a page's title, and extract specific information using JavaScript. We'll build buttons to get the visible text and HTML content, and I'll talk about common issues like JavaScript capitalization and semicolons. You'll also learn how to extract data between two keywords using the FindBetween function. This is part 8.

Transcript

Welcome to another TechHelp video brought to you by AccessLearningZone.com. I'm your instructor, Richard Rost. Today, we're continuing with our Edge Browser Control series. This is part eight. Today, I'm going to show you how to get the page text from whatever page you browse to with the browser control. I'll show you how to get the page text, the HTML of the page, the title of the page, pretty much whatever you want off the page, you can grab.

How do we do this? Well, first up, before we get into it, this is part eight, so if you haven't watched parts one through seven, go watch those first.

Here we are back in our browser form. Let's go ahead and add a button that we can click on to get the text of the page, whatever page we happen to be on.

I want to grab all this text. There are any number of things you might want to do with it. You could grab some information off the page, like if you're going to your bank's website, you could find your account balance. I do that; my account balance is database. I grab the pending amount, the actual balance amount, there's all kinds of stuff you can do. It's called scraping a web page. It's getting whatever text you want off that page. There are lots and lots of benefits to it, depending on what you're trying to do.

Let's go to design view now. I'm going to take the status box and make it smaller. We don't need that much of a text area for a status box. Let's copy and paste that guy. We'll put it right here. This will be where we'll put our page text. I'm going to open you up. Let's change this to "Page Text." You can put more in it than just page text, but it's just my text. I don't care.

Format. Let's give it a little bit of a different color. Maybe that, okay, looks good. Make it a wee bit smaller.

Let's grab a button. Copy the button, paste, slide it down here. We're going to call this guy "Get Page Text." Maybe make it a little bit smaller. Name of the button: GetPageTextButton.

So, we're going to get the text out of here and put it into me.

Right-click, Build Event. Oh, it brings up this thing. Why? Because I copied a button that had one of these event functions in it. So just come back over here to where it says Event and get rid of this guy.

Now you should be able to right-click, Build Event, and it'll bring up the code build.

Here we are. Now, the first thing I'm going to do is blank the page text. So PageText equals empty string, just because if there's something in there from before, I want to clear it and start fresh.

Here's the command: PageText equals WB.RetrieveJavaScriptValue. I don't want to type that whole thing in so I'm just going to hit tab. It's RetrieveJavaScriptValue.

Now, if you know JavaScript, this can be pretty much anything from the document object model in the web page from the JavaScript - the body, the content, the title, HTML, all that stuff. Again, I'm putting together a more in-depth JavaScript lesson that should be out hopefully later this year, but in the meantime, there's tons of good reference books and material online if you want to find out all the stuff you can grab from JavaScript.

I'm going to show you a couple of them today, but pretty much if it's in the web page and you want to get it, you can get it with your JavaScript.

I'm looking for the text content of the body of the document. It looks like this, in parentheses inside quotes: document.body.textContent

JavaScript is case sensitive. This has to be a lowercase t, this has to be an uppercase C, or it won't work. That's not like VB where it properly capitalizes stuff for us, which I actually prefer. I used to be a C programmer and C is also case sensitive and it's annoying when you have different variables and you can't figure out why your code is not working.

Save it, debug, compile. Come back over and yeah, let's close this, close this, open it back up again. Grab a page, any page, and then Get Page Text and there you go. There's the text of the page. It ignores formatting, ignores hyperlinks, ignores anything inside of HTML tags. It's all just literally just the text that you see.

I go to this page, now if I hit Get Page Text, now it looks like you got nothing because there's a whole bunch of blank spaces up here. But if you come in here and you scroll down, there's all the text. Make sure you take a look at that and scroll down. There's all my stuff.

A couple things I want to mention. If this is not properly capitalized and you click the button, you get undefined. Make sure you properly capitalize it; that's very, very important. I can't tell you the number of hours I've lost of my life by trying to find errors with my JavaScript or my C code, and it's because I didn't have something capitalized properly.

There's a reason why I'm repeating this three times because it will happen to you. You'll thank me.

Next up, and this is a weird one. Sometimes you can get away with having a semicolon here because proper JavaScript wants a semicolon. Sometimes you can get away with it, sometimes you can't. Watch this: save it, back out here, go to a page, any page, hit it. Wait a second, it actually takes a second for this error to pop up. In the meantime, your database seems like it's not doing anything and it timed out: "Please verify the JavaScript expression supplied is valid." That's because if you do this, it doesn't like the semicolon. But sometimes it does. It's really inconsistent and it really drives me crazy sometimes because watch this: if you pick another one of the commands, let's say the back button command that we have, right? We've got history.back, put a semicolon after it. Close it, save it, open it, change it, and hit the back button, boom, and it works. It's weird.

Probably, in my opinion, because this is actually a document property, not an actual statement, whereas this is actually giving it a command. So I think usually it'll work with commands, although I've seen sometimes where it doesn't work with commands too. It's inconsistent and it drives me nuts and I don't know who to blame.

Just test it when you do it and see if it works.

What if instead of the page text you want the HTML of the page, because sometimes there's values hiding in HTML, there's actual hidden text fields and stuff you can have. Like, I hide a customer ID in a hidden field; you can grab that if you know the HTML.

The HTML is behind any document on the web. It's what feeds into your browser. We'll put another little button right here, and we'll call this guy GetHTMLButton.

Right-click, Build Event. I don't want to type it all in, so we'll just copy it and paste it. What's the command for this? Well, it's document.documentElement.outerHTML. That's it right there: document.documentElement.outerHTML.

There's also innerHTML, which is just the stuff inside the HTML tags, just like there's innerBody, innerTitle, all kinds of different stuff.

Find a JavaScript reference. I'll try to find a couple links and put them down below for you. Here's two of them: in fact, I like W3Schools and I like Mozilla.org. These are the DOM references, the Document Object Model. They'll tell you all the different parts of the page you can get, your HTML tags, getting elements, getting IDs.

I'm going to show you how to get elements and IDs in the next lesson or two, so I'm going to cover a couple more of these things. This will give you everything you can grab. I'll put links to both of these down below.

But now, go to a page, hit Get HTML, and there's the whole HTML of the page. Of course, we'll zoom in - you can see there it is. That's all the stuff that gets sent to the browser.

Now, what use is all this? Let's say you want to go to your bank's website and grab your account balance off of it. Let's go to our links page here, and oh, here's a link to Bank of Richard. Let's go there. Of course, it's my bank and you can see here that my current balance is right there.

Now that I can get the page text, there's the page text. Let's zoom in and see it. I can see that my balance is between "Current Balance:" and then "Thank You" down below here. Or, if you're like me personally, I don't care about pennies when I'm just doing my daily balances and stuff. I just care about what's between "Current Balance:" and that period. So I just grab that.

Obviously, if you're doing real accounting and you need all the cents, then go through whatever comes next.

But how do we take data in here and find the stuff between two delimiters, like between "Current Balance:" and "Thank You"? You could use my FindBetween function. What's that? I have a whole video on it. Here it is. There's a link. I'll put a link down below for you. If you're a Gold Member, you can just come down here and go to the Code Vault and grab my FindBetween function.

Here it is. Here's the code. I'll pause right now so you can pause the video and start typing, or you can join as a Gold Member, and then you just come in here and click Copy. See, now it's on my clipboard.

Now I come back here and let's close this. Let's go into my global module, and I'll come down here to the bottom, paste it in. Right there is my FindBetween function. You give it a whole string, you give it the start, you give it the end of the string. Optionally, you can have a second end, because sometimes a string will start with one thing and it can end with one or possibly two different things. I built this for my account balances database.

Now all I need to know is FindBetween.

Let's go back here. Let's go back to my bank site.

Let's get that page text. Let's take a look at the page text closer. I want to find between "Current Balance:" and "Thank You". So now I'll make another button.

You could do this all as part of the same button or multiple things in the code. I'll just put a little "Find Balance." Let's say Find Balance. We'll call this FindBalanceButton.

Right-click, Build Event.

Dim S as string. We're gonna say S = FindBetween(PageText, "Current Balance: ", "Thank You"). You can put the space in there if you want to; it doesn't matter, in terms of the spaces. What's the end of the string? We'll look for "Thank You," just like that.

Now that's in S, we can MessageBox S and see what that looks like.

Back out here, close it, close it, open it. Let's go to PCResale.net, go to Links, log into my bank. There's my bank page.

Let's get the page text and now Find Balance. Oh, I'm not using the PageText field, that's my fault. I should have noticed that when it didn't capitalize.

That's my bad. I'm going to leave that error in there, so you see it because you're probably going to do it too. Click, and there it is, there's your bill.

We loaded the page, we got the page text, we scraped it for exactly what we were looking for with my FindBetween function. There it is. Now you can convert that to a currency if you want to, save it in your table, do whatever you want with it. I don't care. It's your information.

You can use the same method to get any amount of information off of the text of a page. Pretty cool stuff.

That FindBetween function is one of the things I use in my account balances template. This is a database design, so you can just put in daily or weekly or a couple times a month, whenever you do it. You go to your different bank account websites, you go to your different credit card accounts, and this will just get the balances off those pages.

I don't use the Edge Browser Control on this; I basically have it load from your existing browser. You basically log on to your bank, and then the database finds that window. But it's the same technique as far as finding the text between two things on the page. That's what I use for here. I'm thinking about integrating the Edge Browser Control into this database too, but I haven't done it yet because I'm still waiting for some bugs to get ironed out. But it's coming along, so pretty soon.

That's going to do it for part eight. Part nine coming up - I'm going to show you how to get values from specific form fields on a web page. We're going to show you how to grab elements out of specific IDs and how to put data into a page if you want to submit a form online.

That's all coming up soon, but that's going to be your TechHelp video for today. I hope you learned something.

Live long and prosper, my friends. I'll see you next time.

Quiz

Q1. What is the main purpose of the Edge Browser Control demonstrated in this tutorial?
A. To automate web-based accounting calculations
B. To navigate between web pages in Access automatically
C. To allow Access users to scrape and extract information from web pages
D. To secure online connections in Access databases

Q2. What is "web scraping" as described in the video?
A. Creating new web pages inside Access
B. Collecting specific information from a web page's content
C. Designing websites using Access forms
D. Automating the logging in to secure sites

Q3. What JavaScript code is used to get the visible text content from a web page in the video?
A. document.body.HTMLContent
B. document.header.textContent
C. document.body.textContent
D. document.innerText

Q4. Why is correct capitalization important when using JavaScript in the context of scraping data as shown in the video?
A. JavaScript is not case sensitive
B. Access will automatically correct capitalization errors
C. JavaScript is case sensitive and requires exact capitalization
D. It only matters for HTML tags, not JavaScript

Q5. What happens if the capitalization in the JavaScript string is incorrect?
A. The database will crash
B. The button will not respond
C. The return value will be "undefined"
D. Only part of the text will be retrieved

Q6. Why might adding a semicolon to a JavaScript statement cause problems when using RetrieveJavaScriptValue in this context?
A. All JavaScript statements require semicolons for proper execution
B. The function expects only property expressions, not statements with semicolons
C. Semicolons are required for all web page scripts
D. Semicolons make the code execute faster

Q7. Which JavaScript code retrieves the complete HTML source of the current web page?
A. document.body.innerHTML
B. document.documentElement.outerHTML
C. document.HTMLContent
D. document.all.HTML

Q8. What purpose can scraping the HTML of a web page serve, according to the video?
A. To change the formatting of the page in Access
B. To find values hidden in input fields or retrieve additional metadata
C. To customize browser settings
D. To download images from the page

Q9. What is the main advantage of using functions like FindBetween with scraped text data?
A. It allows you to convert all text to upper case
B. It helps extract targeted data found between specific delimiters
C. It formats dates and currency automatically
D. It scans pages for broken links

Q10. According to the video, which of the following is NOT a recommended JavaScript reference site for learning about the Document Object Model?
A. MDN (Mozilla.org)
B. W3Schools
C. Stack Overflow
D. Both MDN and W3Schools are recommended

Q11. What will the FindBetween function return if the delimiters are not found in the provided string?
A. The entire string
B. An error message
C. An empty string
D. The last word in the string

Q12. What is planned for the next part (part nine) of the Edge Browser Control series mentioned at the end of the video?
A. Exporting scraped data to Excel
B. Getting values from specific form fields and submitting forms online
C. Building custom browsers for Access
D. Creating a login form for Access

Answers: 1-C; 2-B; 3-C; 4-C; 5-C; 6-B; 7-B; 8-B; 9-B; 10-C; 11-C; 12-B

DISCLAIMER: Quiz questions are AI generated. If you find any that are wrong, don't make sense, or aren't related to the video topic at hand, then please post a comment and let me know. Thanks.

Summary

Today's TechHelp tutorial from Access Learning Zone continues our series on the Edge Browser Control in Microsoft Access. In part eight, I'm going to explain how to extract the text content, HTML, and other information from whatever web page appears inside your browser control. If you need to retrieve details like the page's title, text, or HTML code, you will see how all of that can be captured with the right approach.

Before moving forward, make sure you have already watched parts one through seven in this series, as this lesson builds on the earlier material.

Let's pick up where we left off in our browser form. To start, I add a new button to the form to allow us to grab the current page's text content with a single click. You might want to do all kinds of things with this text. For example, if you go to your bank's website, you could programmatically look up your account balance, your pending transactions, or any other details visible on the page. This process is broadly called "scraping" a web page, where you programmatically extract whatever information you need.

To prepare the form, I adjust the size of the status box since we don't need a large area for status updates. I make room for a new text box where the captured page text will appear. I label this text box as "Page Text" for clarity, though you could certainly extract other types of page information the same way.

Next, I add a new button and label it "Get Page Text." With this button, I initiate the routine to extract the desired content. When setting up the event procedure for this button, I first clear out any previous text inside the "Page Text" box to avoid confusion from leftover data.

The mechanism to capture web page content relies on JavaScript expressions. The function I use allows retrieval of pretty much anything accessible in the document object model – you can pull out the body, title, HTML source, and more. If you are comfortable with JavaScript, you can grab any object or property displayed on the page. For this example, I'm targeting the text content found in the page's <body> section using the JavaScript expression 'document.body.textContent'.

It's important to note that JavaScript is case sensitive, unlike VBA. For this to work, 'textContent' must be written exactly as required, with a lowercase 't' and uppercase 'C'. Getting case wrong will return 'undefined', which can be a frustrating debugging mystery. Trust me, capitalizing things properly is essential when working with JavaScript.

Another little quirk worth mentioning: sometimes, JavaScript statements require a semicolon at the end, but with property retrievals like 'document.body.textContent', a semicolon may actually cause errors. It is inconsistent and depends on the nature of the JavaScript you're executing. The best advice is to test your syntax and adjust as needed.

Now let's say you need more than just the plain page text – sometimes you also want the HTML source code. This can be useful for pulling out hidden values, custom IDs, or any information not visible in the rendered text. I add another button called "Get HTML" for this purpose. The JavaScript expression for the entire HTML markup is 'document.documentElement.outerHTML'. There are variants as well, such as 'innerHTML', depending on whether you want just the content inside the HTML tags or the entire HTML source.

If you want to explore all the different properties and elements you can access, I recommend researching the Document Object Model, or DOM. There are excellent references online, and sites like W3Schools and Mozilla.org provide extensive documentation and examples.

To make this example practical, suppose you want to pull your account balance from your bank's website. After navigating to the relevant page, you can capture the page's text. You can then look for the part of the text containing your balance, for example, between "Current Balance:" and the next occurrence of "Thank You" or a period. If you do not care about cents, you can customize the parsing as needed.

To automate this kind of search, you would use a string handling function. I recommend my 'FindBetween' function, which extracts text located between two specified delimiters. If you are a Gold Member, you can retrieve this function from the Code Vault on my website. Simply add the function to your global module so it's ready for reuse whenever needed.

Once you have captured the page text, you can use the button "Find Balance" to extract the value you are looking for. The function takes the body text, and the start and end delimiters, and returns the contained text. You could take this value, format or convert it as necessary, and store it in your database.

This methodology works for more than just balances. You can use it to grab any specific fragment of data you need from the body or HTML of a web page and incorporate it in your Access solutions. I use this method myself in my own account balances database, enabling me to quickly record balances from a number of websites. While that solution currently opens your default browser rather than using the Edge Browser Control, the concept of parsing out strings from the page is exactly the same.

That wraps up part eight of the Edge Browser Control series. In the next installment, I'll be showing you how to retrieve values directly from specific form fields and elements on a web page, as well as how to automate populating those elements if you want to submit a form.

You can find a complete video tutorial with step-by-step instructions on everything discussed here on my website at the link below.

Live long and prosper, my friends.

Topic List

Adding a Get Page Text button to the browser form
Using RetrieveJavaScriptValue to access page content
Extracting text content with document.body.textContent
Handling case sensitivity in JavaScript for VBA
Troubleshooting undefined values due to capitalization errors
Dealing with JavaScript semicolons in VBA calls
Getting the full HTML content with document.documentElement.outerHTML
Adding a Get Page HTML button to retrieve HTML source
Using a custom FindBetween function to extract text
Extracting values between delimiters in page text
Integrating FindBetween to parse data from scraped content
Displaying extracted information using VBA MessageBox