πΈοΈ Enhanced Web Data Extractor π
A powerful and user-friendly web scraping tool built with Python and Streamlit.
π Features
- π Asynchronous web scraping for faster data collection
- π Depth-limited crawling to control the scope of extraction
- π Keyword filtering to focus on relevant content
- π Multiple export formats: CSV, Markdown, JSON, and XML
- π₯οΈ Interactive Streamlit UI for easy operation
- π‘οΈ Rate limiting to respect server resources
- π Real-time progress tracking
π οΈ Installation
- Clone this repository:
git clone https://github.com/ZeroXClem/enhanced-web-data-extractor.git
cd enhanced-web-data-extractor
- Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
- Install the required packages:
pip install -r requirements.txt
π Usage
- Run the Streamlit app:
streamlit run main.py
- Open your web browser and navigate to the URL provided by Streamlit (usually
http://localhost:8501
). - In the Streamlit interface:
- Enter the base URL you want to scrape
- Set the maximum number of pages to scrape (1-100)
- Set the maximum depth for crawling (1-10)
- (Optional) Enter keywords to filter content
- Set the rate limit (requests per second)
- Choose the desired export format(s)
- Click βStart Scrapingβ
- Monitor the progress and download the extracted data when complete.
π― Use Cases
- π Research: Gather data from academic websites or online journals
- πΌ Business Intelligence: Collect product information from e-commerce sites
- π° News Aggregation: Compile articles from various news sources
- π’ Competitive Analysis: Extract data from competitor websites
- π Market Research: Gather consumer reviews and opinions
β οΈ Important Notes
- This tool is for educational purposes only.
- Always respect websitesβ terms of service and robots.txt files.
- Be mindful of rate limiting and donβt overload servers with requests.
- Some websites may have measures in place to prevent scraping.
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π€ Contributing
Contributions, issues, and feature requests are welcome! Feel free to check issues page.
π¨βπ» Author
ZeroXClem
- GitHub: @ZeroXClem
- LinkedIn: @ZeroXClem LinkedIn
Happy Scraping! ππ·οΈ