Deployment / Scraping
Obtaining content through scraping of web pages, and then allowing for the transformation of data and contained obtained, and deliver as simpler web APIs.
Tools
- Scrapy - An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
- Apache Tika - Apache TikaThe Apache Software Foundation Apache Tika - a content analysis toolkit The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
Services
- Scrapinghub - Scrapinghub is a company that provides web crawling solutions, including a platform for running crawlers, a tool for building scrapers visually, data feed providers (DaaS) and a consulting team to help startups and enterprises build and maintain their web crawling infrastructures.
- ParseHub - ParseHub is a free web scraping tool. With our advanced web scraper, extracting data is as easy as clicking the data you need.
- Import.io - Import.io is intuitive and highly capable. Simply point-and-click to show us the data of interest on a web page. Machine learning based – no coding required
|
Content Harvesting & Extraction |
|
|
Concept Extraction |
|
|
Summarization |
|
|
Entity Extraction |
|
|
Taxonomy & Classification |
|
|
Relation Extraction |
|
|
Article Extraction |
|
|
Discussion Extraction |
|
|
Date Extraction |
|
|
Author Extraction |
|
|
Product Extraction |
|
|
Related Phrases |
|
|
Pagination Extraction |
|
|
Dictionaries |
|
|
Crawling |
|
|
Seed URLs |
|
|
Pseudo-URLs |
|
|
Scripting |
|
|
Conditional Expressions |
|
|
XPath |
|
|
RegEx |
|
|
Injection |
|
|
Timeout |
|
|
Content Storage & Access |
|
|
Content Latest Index |
|
|
Historical Index |
|
|
Storage |
|
|
Search |
|
|
Automation & Orchestration |
|
|
API |
|
|
Webhooks |
|
|
Command Line Interface |
|
|
DNS |
|
|
Domain Lists |
|
|
Domain Metadata |
|
|
Document Processing |
|
|
Feed Detection |
|
|
PDF Extraction |
|
|
Word Documents |
|
|
Integrations |
|
|
Dropbox |
|
|
Amazon S3 |
|
|
Google Sheets |
|
|
Plot.ly |
|
|
Silk |
|
|
Tableau |
|
|
International |
|
|
Language Detection |
|
|
Geo IP Address |
|
|
Machine Learning |
|
|
Semantic Text Analysis |
|
|
Semantic Similarity |
|
|
Sentiment Analysis |
|
|
Emotion Analysis |
|
|
Media Acquisition |
|
|
Image Extraction |
|
|
Video Extraction |
|
|
Image Tagging |
|
|
Image Color Extraction |
|
|
Face Detection |
|
|
Barcode Recognition |
|
|
License Plate Recognition |
|
|
Structured Data Extraction |
|
|
HTML Table Extraction |
|
|
Spreadsheet Extraction |
|
|
CSV Files |
|
|
JSON Files |
|
|
Microformats Parsing |
|
|
XML Extraction |
|
|
Utilities |
|
|
Proxies |
|
|
Cookies |
|
|
Headers |
|
|
User Agents |
|
|
IP Address |
|
|
Logging |
|
|
Batch Calls |
|
|
Scheduler |
|
|
Low Latency |
|
|
Analytics |
|
|
Reporting |
|
|
URL Metrics |
|
|
Spam Score |
|
|
Rankings |
|