Here is a Python script that does the above:
import sys
from bs4 import BeautifulSoup
# Read the HTML file
html_file = sys.argv[1]
with open(html_file, "r") as f:
html = f.read()
# Parse the HTML file using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# 1. Extract out a div tag with a class that starts with the string "ThreadLayout__NodeWrapper-".
# Lets call this the main tag.
main = soup.find("div", class_=lambda c: c and c.startswith("ThreadLayout__NodeWrapper-"))
# 2. Inside the "main" tag, delete the divs that have a class that starts with one of the following strings:
# "Pagination__PaginationWrapper-", "ConversationItem__ActionButtons-", "CodeSnippet__ActionBar-" or "Thread__PositionForm-".
for div in main.find_all("div", class_=lambda c: c and c.startswith(("Pagination__PaginationWrapper-",
"ConversationItem__ActionButtons-",
"CodeSnippet__ActionBar-",
"Thread__PositionForm-"))):
div.decompose()
# 3. For any div with a class starting with the string "Avatar-", remove the style attribute if one exists.
for div in main.find_all("div", class_=lambda c: c and c.startswith("Avatar-")):
div.attrs.pop("style", None)
# 4. For any "p" tags in the document, apply the style attribute with the value "white-space: pre-wrap;" and
# strip leading whitespaces in the text found in any child nodes (recursive), if they have any text.
for p in soup.find_all("p"):
p.attrs["style"] = "white-space: pre-wrap;"
for text in p.find_all(text=True):
text.replace_with(text.strip())
# 5. Wrap the main tag in a standard HTML5 document. Include a link at the top (before the main tag),
# with the text "< Back" linking to the URL "/chatgpt"
html_template = """
<!DOCTYPE html>
<html>
<head>
<title>Modified HTML Document</title>
</head>
<body>
<a href="/chatgpt">< Back</a>
{}
</body>
</html>
"""
# 6. Output the HTML document to stdout without prettifying it.
print(html_template.format(main))
To run the script, you would need to have thebeautifulsoup4
library installed. You can install it usingpip install beautifulsoup4
.
To run the script, you can use the following command:
python script.py path/to/file.html
The output will be printed to the standard output.
Here is a bash script that does the above:
#!/usr/bin/env bash
# Check if the number of arguments is correct
if [ $# -ne 2 ]; then
echo "Usage: $0 input_folder output_folder"
exit 1
fi
# Set the input and output folders
input_folder=$1
output_folder=$2
# Check if the input folder exists and is a directory
if [ ! -d "$input_folder" ]; then
echo "Error: $input_folder does not exist or is not a directory"
exit 1
fi
# Check if the output folder exists and is a directory
if [ ! -d "$output_folder" ]; then
echo "Error: $output_folder does not exist or is not a directory"
exit 1
fi
# Create a temporary directory
tmp_dir=$(mktemp -d)
# Process all HTML files in the input folder
for file in "$input_folder"/*.html; do
# Get the base filename without the extension
filename=$(basename "$file" .html)
# Process the HTML file using the chatgpt_parser.py script
python chatgpt_parser.py "$file" > "$tmp_dir/$filename.html"
done
# Move the processed files from the temporary directory to the output directory
mv "$tmp_dir"/*.html "$output_folder"
# Delete the temporary directory
rmdir "$tmp_dir"
To use the script, you would need to have thechatgpt_parser.py
script in the same directory as this script. You would also need to have thepython
andmktemp
commands in your system'sPATH
.
To run the script, you can use the following command:
./script.sh input_folder output_folder
This will process all the HTML files in theinput_folder
and place the processed files in theoutput_folder
.