Article Extractor
Extract the primary article content, title, byline, publication date, and clean body text from a target webpage or raw HTML input.
Method: POST
Path: /v1/article/extract
Demo: https://api.gugudata.io/v1/article/extract/demo
OpenAPI: https://gugudata.io/assets/openapi/gugudata.openapi.3.1.json
Request Parameters:
- appkey (string, required): Application key used for request authentication. Supply the value as a query parameter, form field, or multipart field according to the request content type.
- url (string, required): Target webpage URL.
Response Fields:
- DataStatus.StatusCode (integer, required): Application-level status code returned by the current v1 contract.
- DataStatus.StatusDescription (string, required): Application-level status message returned by the current v1 contract.
- DataStatus.ResponseDateTime (string, required): Response timestamp returned by the current service contract.
- DataStatus.DataTotalCount (integer, required): Total number of records that match the request.
- Data.url (string, required): Source URL of the article
- Data.title (string, required): Extracted article title
- Data.description (string, optional): Article description/summary
- Data.links (array, optional): Array of links contained in the article
- Data.image (string, optional): Main article image URL
- Data.content (string, required): Extracted article content (HTML format, with ads and navigation removed)
- Data.author (string, optional): Article author (if available, may be empty string)
- Data.favicon (string, optional): Website favicon URL
- Data.source (string, optional): Source website domain (e.g., sohu.com)
- Data.published (string, optional): Article publication date/time (format: YYYY-MM-DD HH:MM)
- Data.ttr (integer, optional): Estimated reading time (Time to Read, in minutes)
- Data.type (string, optional): Article type (e.g., news, article, etc.)
HTTP Status Codes:
- 200: Request processed successfully. Some endpoints expose a separate application-level status field in the response body, such as
dataStatus.statusCode.
- 400: Invalid request parameters or request format. Check required fields, data types, and request body format.
- 401: Missing or unknown application key. Provide a valid
appkey with the request.
- 403: The application key is recognized but access is not allowed. The key may be expired, inactive, or not permitted for the requested API.
- 429: Request rate or trial usage limit exceeded. Reduce concurrency or retry after the limit window resets.
- 500: Internal service error. Retry later or contact support if the error persists.
- 503: Upstream service unavailable. Retry later; the requested upstream dependency is temporarily unavailable.
Business Status Codes:
- 200 Normal return: Article successfully extracted
- 400 Parameter error: Invalid or missing required parameters (url is required)
- 429 Request frequency limited: Cannot exceed 100 requests per second
- 403 Account in arrears: Payment required to continue using the service
- 402 APPKEY error: Please check whether the APPKEY passed is obtained from the developer center
- 500 API response error: Internal server error during article extraction. URL may be inaccessible or content format may be unsupported
- 503 Service unavailable: External service temporarily unavailable
Key Features:
- Extract clean article content from any webpage URL.
- Automatic removal of ads, navigation, and non-content elements.
- Extract article title, content, author, publication date, and metadata.
- Separate endpoint available for HTML string extraction (/v1/article/extractFromHtml).
- High-quality content extraction with intelligent parsing.
- Full API support for HTTPS (TLS v1.0 / v1.1 / v1.2 / v1.3).
- Fully compatible with Apple ATS.
- Nationwide multi-node CDN deployment.
- Ultra-fast response, API interface load balancing built with multiple servers.
Details:
https://gugudata.io/details/article-extract