How to scrape any product page with Python and ChatGPT?
Web scraping is really good, we won't tell you otherwise. It allows to collect publicly accessible data: quickly, with (almost) no errors, and at an ultra-competitive price.
However, the problem is simple: for each site, you have to develop a dedicated robot, called a crawler. A crawler for Amazon, a crawler for Etsy, a crawler for eBay... and it is very expensive.
According to the prices charged by our company, and without a doubt the most competitive on the market! you should count between 500-1000 EUR per site per robot. If you have 5-10 robots, prices can quickly become limiting.
But couldn't we just provide the HTML code of a page to a third party intelligence, and get the critical information?
In this tutorial, we'll see how to scrape any product page with ChatGPT and Python.
Developers, product managers, price watchers: this tutorial is for you!
Here is the complete code, and directly accessible here:
fimport os import requests import html2text import re import argparse OPENAI_API_KEY = 'YOUR_OPEN_AI_API_KEY' COMPLETION_URL = 'https://api.openai.com/v1/chat/completions' PROMPT = """Find the main article from this product page, and return from this text content, as JSON format: article_title article_url article_price %s""" MAX_GPT_WORDS = 2000 class pricingPagesGPTScraper: def __init__(self): self.s = requests.Session() def get_html(self, url): assert url and isinstance(url, str) print('[get_html]\n%s' % url) headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'accept-language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7', 'cache-control': 'max-age=0', 'sec-ch-device-memory': '8', 'sec-ch-dpr': '2', 'sec-ch-ua': '"Chromium";v="112", "Google Chrome";v="112", "Not:A-Brand";v="99"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"macOS"', 'sec-ch-ua-platform-version': '"12.5.0"', 'sec-ch-viewport-width': '1469', 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'none', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36', 'viewport-width': '1469', } self.s.headers = headers r = self.s.get(url) assert r.status_code == 200 html = r.text return html def convert_html_to_text(self, html): assert html h = html2text.HTML2Text() h.ignore_links = True h.ignore_images = True text = h.handle(html) assert text return text def reduce_text_size(self, text): print('Starting text size: %s' % len(text)) assert text words = re.findall(r'\w+', text) if len(words) > MAX_GPT_WORDS: initial_characters = len(text) size_ratio = len(words)/MAX_GPT_WORDS print('/!\\ text too large! size being divided by %s' % size_ratio) max_characters = int(initial_characters//size_ratio) text = text[:max_characters] print('Ending text size: %s' % len(text)) return text def fill_prompt(self, text): assert text prompt = PROMPT % text return prompt # @retry(AssertionError, tries=3, delay=2) def get_gpt(self, prompt): headers = { 'Authorization': 'Bearer %s' % OPENAI_API_KEY, } json_data = { 'model': 'gpt-3.5-turbo', 'messages': [ { "role": "user", "content": prompt } ], 'temperature': 0.7 } response = requests.post(COMPLETION_URL, headers=headers, json=json_data) assert response.status_code == 200 content = response.json()["choices"][0]["message"]["content"] return content def main(self, url): assert url html = self.get_html(url) text = self.convert_html_to_text(html) text = self.reduce_text_size(text) prompt = self.fill_prompt(text) answer = self.get_gpt(prompt) return answer def main(): argparser = argparse.ArgumentParser() argparser.add_argument('--url', '-u', type=str, required=False, help='product page url to be scraped', default='https://www.amazon.com/dp/B09723XSVM') args = argparser.parse_args() url = args.url assert url pp = pricingPagesGPTScraper() answer = pp.main(url) print(answer) print('''~~ success _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_| ''') if __name__ == '__main__': main()
To use this script, it's very simple: download the .py file, change the value of the Open AI API key with your own, and run the script as follows, specifying the URL to scrape:
f$ python3 chatgpt_powered_product_page_universal_scraper.py --url https://www.walmart.com/ip/1146797 [get_html] https://www.walmart.com/ip/1146797 Starting text size: 1915 Ending text size: 1915 { "article_title": "Weber 14\" Smokey Joe Charcoal Grill, Black", "article_url": "", "article_price": "USD$45.99" } ~~ success _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_|
And that's it!
Having an idea in your head is good. But how do you implement it?
In this tutorial, we will see how to code a complete tool, which will fetch the content of an HTML page, provide it to ChatGPT, and let it return the price, the title, and the URL of an item in JSON format.
To put it simply, this is how our program will work:
This tutorial will be divided into 6 steps:
Let's go!
We all know ChatGPT, the interface that allows us to talk directly with GPT 3.5, the artificial intelligence developed by Open AI. But how to use this artificial intelligence directly from a Python script?
Well, it's simple, you have to go through their API, and therefore get the API key that will allow our program to interact with the program developed by Open AI.
To do this, proceed as follows:
As seen here:
And that's it!
Finally, we will write the value of the key in our script:
fOPENAI_API_KEY = 'sk-dTRYAg…'
We will now be able, with our Python script, to interact with the Open AI API in a programmatic way. And thus interact with the exceptional robotic intelligence ChatGPT.
By creating your account, associated with your phone number, you get $5 of free credit, which is about 750 API calls. That's not enough if you have a structural need, but it will be more than enough for the purpose of this tutorial.
Let's play.
As seen in the introduction, the program works as follows: it first retrieves the HTML code of a page, cleans it up, and then sends it to Open AI's artificial intelligence ChatGPT, so that it identifies the relevant elements, namely the price, the title, and the URL of the product.
In other words, the Python script takes care of the navigation on the pages and the retrieval of the raw content, what we call browsing, while ChatGPT takes care of the parsing, the retrieval of information on a page.
We are going to retrieve with our script the HTML content of the page.
We install the requests library, the most downloaded third party Python library in the world, which allows to browse the Internet with a script.
As follows:
f$ pip3 install requests
We will then create our class, pricingPagesGPTScraper, with a first method, get_html, which in input takes a URL, and in output returns the content of an HTML page.
As follows:
fimport requests class pricingPagesGPTScraper: def __init__(self): self.s = requests.Session() def get_html(self, url): assert url and isinstance(url, str) print('[get_html]\n%s' % url) headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'accept-language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7', 'cache-control': 'max-age=0', 'sec-ch-device-memory': '8', 'sec-ch-dpr': '2', 'sec-ch-ua': '"Chromium";v="112", "Google Chrome";v="112", "Not:A-Brand";v="99"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"macOS"', 'sec-ch-ua-platform-version': '"12.5.0"', 'sec-ch-viewport-width': '1469', 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'none', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36', 'viewport-width': '1469', } self.s.headers = headers r = self.s.get(url) assert r.status_code == 200 html = r.text return html
The values taken in headers are the usual values used by a Chrome browser. If you want to get the ones from your browser, open your Network part inspection tool, right-click, copy as cURL, and convert your cURL to Python format here.
Before developing a script, we will first test our idea directly from the graphical interface proposed by Open AI. Let's take a first example product page, the following one:
https://www.amazon.com/dp/B0BZVZ6Z8C
After all it's summer, BBQ and sausage in the spotlight.
We go on the page, right-click, Display the source code of the page. A new tab will open, with the HTML code of the page, that our Python browser will recover.
We make Select all, copy, then we open ChatGPT, and we paste the content.
And there first shower, we receive this rather explicit error message:
The message you submitted was too long, please reload the conversation and submit something shorter.
In other words, the message is too long. How to reduce the size of the text, without affecting the quality of the content?
Well, it's simple, we'll convert the HTML content into text content!
After a quick search on Google, we find this StackOverFlow thread, which recommends us to use the html2text library:
So we install the library in question :
f$ pip3 install html2text
Then we import the library into our script, and create a new method to convert the html code into text, convert_html_to_text as follows:
fimport html2text ... class pricingPagesGPTScraper: ... def convert_html_to_text(self, html): assert text h = html2text.HTML2Text() h.ignore_links = True h.ignore_images = True text = h.handle(html) print('HTML size: %s' % len(html)) print('Text size: %s' % len(text)) return text
The two attributes ignore_links _and _ignore_images allow us to not convert the hyperlinks associated with the images and links on the page to text. Since ChatGPT only works with text content, this is perfect for us!
We launch the program, and... eureka! The size of the text has been divided by 100.
f$ python3 chatgpt_powered_product_page_universal_scraper.py HTML size: 2084770 Text size: 22287
Above all, the quality of the text is ultra-qualitative, with only text readable and understandable by a human (or with artificial intelligence), and no more content for third-party computer programs.
We recover the text content, we put it again on ChatGPT, and... suspense... we have more unreadable text, but the result remains the same: **the text is still too long.
In the next part, we will see how to reduce the text size.
As you can see, the text you provide to ChatGPT is too long. Terribly long, with more than 20000 characters.
But what is the maximum size that ChatGPT can accommodate?
After a quick Google search, it appears that the program can collect:
What is the difference between a token and a word? According to OpenAI, think of a token as a collection of characters, with 1 token equal to 4 characters, or 0.75 words.
Given the uncertainty about the ratio between the number of tokens and words, we will choose a lower limit of 2000 words, and here is the part of the code that will limit the size of the text content:
fMAX_GPT_WORDS = 2000 ... class pricingPagesGPTScraper: ... def reduce_text_size(self, text): assert text words = re.findall(r'\w+', text) if len(words) > MAX_GPT_WORDS: initial_characters = len(text) size_ratio = len(words)/MAX_GPT_WORDS print('/!\\ text too large! size being divided by %s' % size_ratio) max_characters = int(initial_characters//size_ratio) text = text[:max_characters] return text
The code works as follows:
And that's it!
The text is now reduced, cut by approximately 30-40% of its size.
Short and sweet.
f$ python3 chatgpt_powered_product_page_universal_scraper.py Starting text size: 22328 /!\ text too large! size being divided by 1.788 Ending text size: 12487
To avoid a potentially qualitative loss of information, one could have cut the text in several parts, synthesize each part, and provide the summary to be parsed to ChatGPT. This article is interesting on this subject. To be explored in future experiments.
Our text is of good size, we'll now submit it to ChatGPT, so that it can fetch the right information.
Let's get to the prompt!
It's time to interact with ChatGPT!
In this part, we will generate a first prompt, which will allow us to interact with ChatGPT, which will fetch in the text provided, the main information of the product page, namely
For this first test, we will keep the example chosen initially: https://www.amazon.com/dp/B0BZVZ6Z8C
And in order to facilitate the programmatic processing of the data, here is the first prompt we will use:
Find the main article from this product page, and return from this text content, as JSON format:
article_title
article_url
article_price
Finally, we'll send this to OpenAI, using the completion route via chat:
f... COMPLETION_URL = 'https://api.openai.com/v1/chat/completions' AMAZON_URL = 'https://www.amazon.com/dp/B0BZVZ6Z8C' PROMPT = """Find the main article from this product page, and return from this text content, as JSON format: article_title article_url article_price %s""" class pricingPagesGPTScraper: ... def fill_prompt(self, text): assert text prompt = PROMPT % text return prompt def get_gpt(self, prompt): headers = { 'Authorization': 'Bearer %s' % OPENAI_API_KEY, } json_data = { 'model': 'gpt-3.5-turbo', 'messages': [ { "role": "user", "content": prompt } ], 'temperature': 0.7 } response = requests.post(COMPLETION_URL, headers=headers, json=json_data) assert response.status_code == 200 content = response.json()["choices"][0]["message"]["content"] return content
ChatGPT4 was released on March 14, 2023, but it is currently not available via the API, unless you are on the OpenAI preferred list. So we chose ChatGPT 3.5-turbo, and you'll see, it works very well.
We run the program, and this time the result is amazing, here is what we get:
f$ python3 chatgpt_powered_product_page_universal_scraper.py [get_html] https://www.amazon.com/dp/B0BZVZ6Z8C Starting text size: 13730 /!\ text too large! size being divided by 1.056 Ending text size: 13001 { "article_title": "Barbecue Grill, American-Style Braised Oven Courtyard Outdoor Charcoal Grilled Steak Camping Household Charcoal Grill Suitable for Family Gatherings", "article_url": "https://www.amazon.com/dp/B0BZVZ6Z8C", "article_price": "$108.89" }
A perfectly structured JSON, with 3 elements
And when we go to the page, we find these elements there:
The account is good, beautiful!
The Amazon article is good. But maybe it's a fluke.
How to test any url?
We'll use the argsparse library, to be able to fill in, directly from our command line, the URL of the product page from which we want to get the price information.
We will first consolidate our method, to keep only one final method:
f... class pricingPagesGPTScraper: ... def main(self, url): assert url html = self.get_html(url) text = self.convert_html_to_text(html) text = self.reduce_text_size(text) prompt = self.fill_prompt(text) answer = self.get_gpt(prompt) return answer
Then at the level of the main function, we will get the url mentioned in the command line:
fimport argparse ... def main(): argparser = argparse.ArgumentParser() argparser.add_argument('--url', '-u', type=str, required=False, help='product page url to be scraped', default='https://www.amazon.com/dp/B09723XSVM') args = argparser.parse_args() url = args.url assert url pp = pricingPagesGPTScraper() answer = pp.main(url) print(answer) print('''~~ success _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_| ''') if __name__ == '__main__': main()
Did our program work 1 time by chance, or is it really robust? Let's find out immediately.
A first test here: https://www.ebay.com/itm/165656992670
f$ python3 chatgpt_powered_product_page_universal_scraper.py --url https://www.ebay.com/itm/165656992670 [get_html] https://www.ebay.com/itm/165656992670 Starting text size: 11053 Ending text size: 11053 { "article_title": "Portable mangal brazier 10 skewers steel 2.5 mm grill BBQ case for free", "article_url": "", "article_price": "US $209.98" } ~~ success _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_|
We are missing the URL here, but for the rest it worked perfectly:
And one last try, with this nice barbecue from Walmart: https://www.walmart.com/ip/Weber-14-Smokey-Joe-Charcoal-Grill-Black/1146797
f$ python3 chatgpt_powered_product_page_universal_scraper.py --url https://www.walmart.com/ip/1146797 [get_html] https://www.walmart.com/ip/1146797 Starting text size: 1948 Ending text size: 1948 { "article_title": "Weber 14\" Smokey Joe Charcoal Grill, Black", "article_url": null, "article_price": "USD$45.99" } ~~ success _ _ _ | | | | | | | | ___ | |__ ___| |_ __ __ | |/ _ \| '_ \/ __| __/| '__| | | (_) | |_) \__ \ |_ | | |_|\___/|_.__/|___/\__||_|
And the same thing, a success, with the good price of the product and the title, the URL being not available :
On 3 tries, we have a solid return, with :
✅3 titles ✅3 prices 🔴1 url
Not perfect, but more than satisfactory. It works!
As seen in the previous section, automatic scraping of any product page works superbly with ChatGPT and Python!
What are the benefits of this solution?
First of all, as presented in the introduction, this solution brings an important flexibility. Indeed, there is no need to develop a dedicated robot for each type of product page. You just have to provide the HTML code of a page and get the benefits.
Consequently, this allows massive cost reduction, especially when you have many different types of pages. For example, with 100 product pages of different structure, the solution will be particularly competitive in terms of cost.
If the advantages are certain, what are the disadvantages?
First of all, and as we saw in the tutorial, the result is imprecise. Out of 3 URLs provided, we obtained only 1 URL, the two other URLs being unfortunately missing.
More seriously, if we do a test with this URL, here is the result we get:
f$ python3 chatgpt_powered_product_page_universal_scraper.py --url https://www.amazon.com/dp/B0014C2NBC { "article_title": "Crocs Unisex-Adult Classic Clog", "article_url": "https://www.amazon.com/dp/B07DMMZPW9", "article_price": "$30.00$30.00 - $59.99$59.99" }
The format of the price is strangely pasted, which makes it difficult to exploit:
And the result is even more inaccurate when you add additional attributes: ranking, number of reviews, score, categories, associated products, delivery date etc.
Moreover, if you multiply the collections on several product pages, it quickly appears that the result is unstable.
For example, if we modify the prompt to obtain the URL of the product image, we will obtain the following results after two attempts:
f$ python3 chatgpt_powered_product_page_universal_scraper.py --url https://www.amazon.com/dp/B0BQZ9K23L { "article_title": "Stanley Quencher H2.0 FlowState Stainless Steel Vacuum Insulated Tumbler with Lid and Straw for Water, Iced Tea or Coffee, Smoothie and More", "article_image_url": null } $ python3 chatgpt_powered_product_page_universal_scraper.py --url https://www.amazon.com/dp/B0BQZ9K23L { "article_title": "Stanley Quencher H2.0 FlowState Stainless Steel Vacuum Insulated Tumbler with Lid and Straw for Water, Iced Tea or Coffee, Smoothie and More", "article_image_url": "https://m.media-amazon.com/images/I/61k8q3y1V7L._AC_SL1500_.jpg" }
One shot yes, one shot no.
Then, as seen in the tutorial, the size of the input text is limited. If in most cases, it is OK, it can happen that the text of the product page is cut in 2, and lose 50% of its initial size.
Starting text size: 26556
/!\ text too large! size being divided by 2.088
Ending text size: 12718
In these situations, one can imagine that this will have a negative impact on the rendering.
Furthermore, and as mentioned in the FAQ, the request speed is slow. It takes an average of 7 seconds per collection, compared to 1-2 seconds for a request on a consolidated API like Keepa.
Finally, the price is high. As seen in the FAQ section, it takes about $4 for 1000 products. This is a competitive price when you think that you can scrape any product page.
It is however a high price when compared to a dedicated API like Rainforest on Amazon, which offers $1 for 1000 requests, with the price of the request to the site, and the retrieval of a few hundred attributes on the product page.
This solution is nice, but how much does it cost?
According to OpenAI's pricing page, at the moment it costs $0.002 / 1K tokens. With up to 2000 words of text content, or about 2000 tokens, that's $0.004 per product page, or $4 for 1000 pages.
Yes, it is entirely legal!
As stipulated in article L342-3 of the intellectual property code, when a database is made available to the public, the public can exploit it, as long as it is not an exhaustive exploitation.
You will find a detailed article here.
We did the experiment with 3 barbecue URLs. The charm of summer.
https://www.ebay.com/itm/165656992670 https://www.walmart.com/ip/1146797 https://www.amazon.com/dp/B09723XSVM
And we calculated the execution speed for each URL.
Here is the result:
f$ python3 test_speed_chatgpt_powered_product_page_universal_scraper.py amazon.com 10.951031041999999 walmart.com 5.695792166000002 ebay.com 7.3706024580000005
It is therefore necessary to count on average 7 seconds per request. To be confirmed of course with a larger sample.
No, not for the moment. But it will be soon!
And that's the end of this tutorial!
In this tutorial, we saw how to scrape any product page with Python, and OpenAI's amazing artificial intelligence, ChatGPT. And retrieve the main attributes: the product name, the price, and the product URL.
Let's face it, if you need to scrape the main attributes from 100 product pages, all of which come from different websites, this is a real revolution. It's solid, (rather) reliable, inexpensive and rather fast.
Be careful though, if you need to collect precise information, always accurate, with a large volume spread over the same site(s), you may end up paying more than you expected, for unreliable information.
In this case, we invite you to contact us directly here.
Happy scraping!
🦀
Co-founder @ lobstr.io since 2019. Genuine data avid and lowercase aesthetic observer. Ensure you get the hot data you need.