Web scraping with PyScript

This is my first day looking at PyScript and I have a question that will determine if I investigate further. At first glance, it looks very promising.

I have a Beautiful Soup application. Can I run it (with suitable modification) in PyScript?

I know I can’t use Selenium since it depends on WebDriver. I assume I can trigger click events on button and option DOM elements.

Depending on the answer to this I may have more specific questions later.

Thanks, Tom

1 Like

You surely can use Beautiful Soup in PyScript, probably without much if any modification of the bs4 code. For example, here’s a (slapdash) snippet of code that you can run in a py-script tag that prints out the tree of tags from the current page. (You’ll need to add beautifulsoup4 to the page’s <py-env> tag):

from bs4 import BeautifulSoup

from js import document
from pyodide.http import open_url

#Use the follow to get another page's content synchronously;
#otherwise we will use the current page
#page_html = open_url('hello_world.html')

page_html = document.documentElement.innerHTML
soup = BeautifulSoup(page_html, 'html.parser')

def print_self_and_children(tag, indent = 0):
    print("_" * indent + str(tag.name))
    if hasattr(tag, 'children'):
        for child in tag.children:
            if hasattr(child, 'name') and child.name is not None: print_self_and_children(child, indent = indent + 2)

print_self_and_children(soup)

image

As you say, Selenium won’t work inside a browser environment, but you can use DOM Selectors and various interaction methods (click(), option.selected, etc) to test interaction if need be.

I am trying to scrape via Pyscript as well, primarily because I am trying to create an application where the scraper is using existing logins / authentications so I don’t have to create any manual steps where the user will have to manually enter their credentials upon a loaded webpage via Selenium.

It does however seem like this is the only option as I need to automate the download of a chromedriver via something like this:

Automatic download of appropriate chromedriver for Selenium in Python - Stack Overflow

Once this happens, Selenium can do its thing.

I believe this will require the following packages:

  • requests
  • wget
  • zipfile
  • os
  • selenium

Is this possible with Pyscript?

Thank you for the help!

Sadly, neither requests nor selenium will work within a browser window - that is to say, running with PyScript. Requests relies significantly on the ssl package, and sockets are not available within a browser window. And Selenium relies on being able to instantiate an instance of the browser itself (headlessly or in a window), which a browser environment won’t permit you to do.

1 Like

Thanks for the suggestions. I have determined that I can do what I want with PyScript.

I’m now asking for guidance on using PyScrip in browser extensions, accessing the DOM for the current page.

Since this topic is not restricted to web scraping, perhaps I should start a new thread?

Hello Jeff, I am trying to get http request to get the html before using beautiful soup but I keep running to this error. I also followed you code but i am still lost. please help.

raceback (most recent call last):
File “/lib/python3.10/_pyodide/_base.py”, line 460, in eval_code
.run(globals, locals)
File “/lib/python3.10/_pyodide/_base.py”, line 306, in run
coroutine = eval(self.code, globals, locals)
File “”, line 9, in
File “/lib/python3.10/site-packages/requests/api.py”, line 73, in get
return request(“get”, url, params=params, **kwargs)
File “/lib/python3.10/site-packages/requests/api.py”, line 59, in request
return session.request(method=method, url=url, **kwargs)
File “/lib/python3.10/site-packages/requests/sessions.py”, line 589, in request
resp = self.send(prep, **send_kwargs)
File “/lib/python3.10/site-packages/requests/sessions.py”, line 703, in send
r = adapter.send(request, **kwargs)
File “/lib/python3.10/site-packages/pyodide_http/_requests.py”, line 42, in send
resp = send(pyodide_request, stream)
File “/lib/python3.10/site-packages/pyodide_http/_core.py”, line 113, in send
xhr.send(to_js(request.body))
pyodide.JsException: NetworkError: Failed to execute ‘send’ on ‘XMLHttpRequest’: Failed to load 'https://www