Webpage Readable Content Extraction
Extract cleaned, reader-friendly article content from a webpage URL or a raw HTML payload.
Method: POST
Path: /v1/websitetools/readability
Demo: https://api.gugudata.io/v1/websitetools/readability/demo
OpenAPI: https://gugudata.io/assets/openapi/gugudata.openapi.3.1.json
Request Parameters:
- appkey (string, required): Application key used for request authentication. Supply the value as a query parameter, form field, or multipart field according to the request content type.
- html (string, optional): Raw HTML content. Supply either
html or url.
- url (string, optional): Target webpage URL. Supply either
url or html.
Response Fields:
- DataStatus.RequestParameter (string, required): Normalized request parameters echoed by the service. Sensitive credentials are omitted when available.
- DataStatus.StatusCode (integer, required): Application-level status code returned by the current v1 contract.
- DataStatus.StatusDescription (string, required): Application-level status message returned by the current v1 contract.
- DataStatus.ResponseDateTime (string, required): Response timestamp returned by the current service contract.
- DataStatus.DataTotalCount (integer, required): Total number of records that match the request.
- Data.Title (string, required): Article title
- Data.Byline (string, required): Article author
- Data.Dir (string, required): Article text direction
- Data.Lang (string, required): Article language
- Data.Content (string, required): Article content
- Data.TextContent (string, required): Article content (without HTML tags, divided by paragraphs)
- Data.Length (integer, required): Article length
- Data.Excerpt (string, required): Article excerpt
- Data.SiteName (string, required): Website name
- Data.PublishedTime (array, required): Article publication time
HTTP Status Codes:
- 200: Request processed successfully. Some endpoints expose a separate application-level status field in the response body, such as
dataStatus.statusCode.
- 400: Invalid request parameters or request format. Check required fields, data types, and request body format.
- 401: Missing or unknown application key. Provide a valid
appkey with the request.
- 403: The application key is recognized but access is not allowed. The key may be expired, inactive, or not permitted for the requested API.
- 429: Request rate or trial usage limit exceeded. Reduce concurrency or retry after the limit window resets.
- 500: Internal service error. Retry later or contact support if the error persists.
- 503: Upstream service unavailable. Retry later; the requested upstream dependency is temporarily unavailable.
Business Status Codes:
- 200 Normal return: No additional remark.
- 400 Parameter error: No additional remark.
- 429 Request frequency limited: Cannot exceed 100 requests per second
- 403 Account in arrears: Please pay attention to the order expiration SMS reminders in time
- 402 APPKEY error: Please check whether the APPKEY passed is obtained from the developer center
- 500 API response error: No additional remark.
Key Features:
- Intelligently extracts readable content from webpages.
- Provides HTML code of the webpage's readable content.
- Supports passing either webpage HTML or webpage URL parameters.
- Supports extraction of various elements information including article title, author, text direction, language, content, content (without HTML tags, divided by paragraphs), article length, excerpt, website name, publication time.
- Second-level parsing performance, supporting high concurrency.
- Supports HTTPS (TLS v1.0 / v1.1 / v1.2 / v1.3) for all interfaces.
- Fully compatible with Apple ATS.
- Nationwide multi-node CDN deployment.
- Rapid response of the interface, with multiple servers building API interface load balancing.
Details:
https://gugudata.io/details/readability