3 min read

Enhanced Web Data Extractor

Table of Contents

πŸ•ΈοΈ Enhanced Web Data Extractor πŸ”

A powerful and user-friendly web scraping tool built with Python and Streamlit.

🌟 Features

  • πŸš€ Asynchronous web scraping for faster data collection
  • 🌐 Depth-limited crawling to control the scope of extraction
  • πŸ”‘ Keyword filtering to focus on relevant content
  • πŸ“Š Multiple export formats: CSV, Markdown, JSON, and XML
  • πŸ–₯️ Interactive Streamlit UI for easy operation
  • πŸ›‘οΈ Rate limiting to respect server resources
  • πŸ“ˆ Real-time progress tracking

πŸ› οΈ Installation

  • Clone this repository:
git clone https://github.com/ZeroXClem/enhanced-web-data-extractor.git
cd enhanced-web-data-extractor
  • Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  • Install the required packages:
pip install -r requirements.txt

πŸš€ Usage

  • Run the Streamlit app:
streamlit run main.py
  • Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).
  • In the Streamlit interface:
    • Enter the base URL you want to scrape
    • Set the maximum number of pages to scrape (1-100)
    • Set the maximum depth for crawling (1-10)
    • (Optional) Enter keywords to filter content
    • Set the rate limit (requests per second)
    • Choose the desired export format(s)
    • Click β€œStart Scraping”
  • Monitor the progress and download the extracted data when complete.

🎯 Use Cases

  • πŸ“š Research: Gather data from academic websites or online journals
  • πŸ’Ό Business Intelligence: Collect product information from e-commerce sites
  • πŸ“° News Aggregation: Compile articles from various news sources
  • 🏒 Competitive Analysis: Extract data from competitor websites
  • πŸ“Š Market Research: Gather consumer reviews and opinions

⚠️ Important Notes

  • This tool is for educational purposes only.
  • Always respect websites’ terms of service and robots.txt files.
  • Be mindful of rate limiting and don’t overload servers with requests.
  • Some websites may have measures in place to prevent scraping.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check issues page.

πŸ‘¨β€πŸ’» Author

ZeroXClem

Happy Scraping! πŸŽ‰πŸ•·οΈ