This Xiaohongshu image and text layout AI skill has found a way to generate images and text that bypasses AI annotation.

In February 2026, Xiaohongshu issued an announcement requiring all AI-generated content to be actively labeled, with unlabeled content facing distribution restrictions. Three months later, an open-source project called guizang-social-card-skill appeared on GitHub, specifically designed to generate 3:4 images and text for Xiaohongshu and WeChat official account covers. Its technical approach was unusual: it didn't use any AI models to generate image pixels; the entire image was rendered using HTML and CSS, with images sourced from real-world image libraries like Unsplash. The output wasn't an "AI-generated image," but rather a rasterized screenshot of a webpage generated by a browser engine.

This choice corresponds to a specific change. Since 2026, Xiaohongshu has launched an audio-visual recognition model, which analyzes the pixel distribution patterns of images and audio features to identify AIGC content. During the same period, it dealt with over 800,000 AI-managed accounts and nearly 150,000 AI-generated fake notes. For content creators who need to produce images and text frequently, the probability of images generated by Midjourney or Canva AI being detected and labeled is continuously increasing. Zang Shifu's Skill chose another path: letting AI make layout decisions and handing over the final pixel values to the rendering engine and real-life image library.

This is a deliberate technological detour. But how far this approach can go depends on how flexible the platform is in defining the term "AI-generated synthetic content."

The AI is responsible for the layout logic, not the drawing, of the 28 layout skeletons.

Master Zang, whose real name is Guicang, previously released guizang-ppt-skill , an AI tool also designed for text and image layout. This time, social-card-skill is more focused: targeting 3:4 text and images on Xiaohongshu, and 1:1 and 21:9 covers on WeChat official accounts, with output resolutions of 1080×1440, 1080×1080, and 2100×900 respectively.

In terms of technical architecture, this skill has 28 built-in layout skeletons, divided into two visual systems: Editorial (magazine style, 16 layouts) and Swiss (Swiss International Style, 12 layouts), along with 10 preset theme colors. After the user inputs the destination, itinerary, or note theme, the AI is responsible for selecting the appropriate layout skeleton, determining the text position, processing map annotation parameters, and then writing all design decisions into HTML+CSS. The Playwright rendering engine takes over the subsequent steps, outputting PNGs page by page.

One particularly useful component for travel bloggers is the map module. It uses MapLibre to load real tiles from OpenStreetMap, supporting multiple location markers and connections. Users simply provide the city or attraction name, and AI automatically generates a labeled base map and embeds it into the layout. The accompanying image source workflow has a clear priority: user-provided real photos have the highest priority; if no user-provided images are available, images are automatically retrieved in the order of Unsplash → Pexels → Flickr CC → Wallhaven.

The entire process is executed in seven steps: Intake → Style & Theme → Layout Selection → Asset Prep → Compose & Render → Deliver & Review → Iterate. Each step is recorded in the .poster file in the task directory. When generating images in batches, node render.mjs is run, and Playwright renders them one by one. A separate validation script, validate-social-deck.mjs measures DOM elements in a real browser environment to detect layout issues such as text overflow, font size exceeding the limit, and footer element collision.

The design goal of this mechanism is clear: to be as precise and controllable as print typesetting software, rather than as free but unpredictable as a diffusion model. The trade-off is that creative freedom is confined to 28 grids. For creators who rely on personal photographic style, hand-drawn elements, or irregular collages, these layout skeletons offer not increased efficiency, but rather design constraints.

Regarding the barrier to entry, the CLI version requires Playwright and Node.js environments to be installed, along with API access to Claude Code or Codex. There is also a web-based version accessible at xiaohongshu.guizang.ai for non-developer users, but whether its functionality is identical to the CLI version is not publicly available. Several tweets from the developers on the X platform and the frequently updated README indicate that this project is still undergoing rapid iteration.

Pixels do not come from the generative model, but compliance does not equate to long-term security.

Based on publicly available information and technical data, Xiaohongshu's AI content detection logic relies primarily on an audio-visual recognition model. This model analyzes the pixel distribution patterns of images to determine whether content originates from an AI-generated model. Diffusion models and GANs leave specific statistical features at the pixel level when generating images. These features differ from the natural lighting, lens distortion, and noise patterns captured by camera sensors. The training objective of the audio-visual recognition model is precisely to capture these statistical inconsistencies.

Master Zang's Skill avoidance logic is based on a key distinction: the pixels of its output images do not come from any generative model. The HTML rendering engine rasterizes CSS styles, resulting in pixel distribution characteristics that more closely resemble browser screenshots or desktop publishing software output. The photos are sourced from real-life images in image libraries such as Unsplash; these images are taken with a camera and manually post-processed, without carrying traces of the generative model.

However, this distinction is contingent on the platform's definition of "AI-generated synthetic content" precisely falling within the scope of "AI model-generated pixels." Xiaohongshu's official announcement uses the phrase "AI-generated synthetic content," which doesn't have a narrow scope. Once the platform expands the definition to "AI-assisted design program rendering output," or incorporates the browser rendering characteristics of HTML rasterized images into the recognition model's training set, the current technological advantages of this approach will disappear.

The platform has the technological foundation and governance motivation to expand its definition. The audio-visual recognition model itself is continuously iterating. If a large number of comparative samples of HTML rendered images and AI-generated images are included in the training data, the model can learn to distinguish between "subpixel anti-aliasing features of browser font rendering" and "irregular pixel blocks in GAN text generation." Currently, there is no publicly available information indicating that Xiaohongshu has started training in this direction, but from the perspective of the model's capability boundaries, such expansion is technically feasible.

A more pressing issue is the compliance requirements related to mini-program hosting. Currently, there is no official documentation indicating that this skill has integrated a model registration number or completed relevant compliance registration. If platforms add traceability requirements for the image generation toolchain in their content review process, the lack of registration information could become a new point of failure.

API template engines, platform customization tools, and HTML rendering are forking three different paths.

Observing the tools available on the market for generating images for social media reveals that they are diverging into three different technological paths. Each faces a different structure of content moderation risks.

AI models directly generate images . This approach is represented by Canva AI's Magic Design feature, released in April 2026, which directly generates design drafts containing AI visual elements from text prompts. Images generated by models like Midjourney and DALL·E also fall into this category. The problem is clear: these images are the primary detection targets for audio-visual recognition models. Canva's approach is to encourage transparent labeling, rather than avoiding detection. On Xiaohongshu, whether labeling posts with AI model-generated images will lower their recommendation weight is not publicly available data, but the platform's statement that "unlabeled AI content restricts distribution" is an established policy. With each update to the diffusion model, pixel statistical features may change, and the corresponding detection model will iterate accordingly, meaning creators are facing a constantly moving target.

API template engine rendering . Bannerbear is a typical example of this approach. Users create templates in the designer, modify layer variables by passing JSON data via a REST API, and the server renders and outputs PNG or JPG. Its core is also "procedural rendering" rather than "model-generated pixels," and the output does not contain any traces of model diffusion. The difference between Bannerbear and Zangshifu Skill is that Bannerbear's templates rely on manual design, and AI does not participate in layout decisions; Zangshifu Skill allows Claude to directly read and write HTML, and the layout selection is given to AI. The risk of the Bannerbear solution lies in another dimension: when a large number of accounts use the same template, the same color scheme, and the same font to produce images and text, even if each image is not AI-generated, it will trigger "procedural mass production" pattern recognition on the platform side. The triggering conditions of anti-spam rules are not entirely equivalent to AI detection, but for creators operating accounts in batches, the result is still limited distribution.

Platform-customized generation . Pin Generator is designed specifically for Pinterest, automatically generating pin images that conform to the platform's algorithm preferences. The core of this approach is not circumvention, but complete adaptation—size, visual style, and posting schedule all align with platform guidelines. The advantage is the lowest risk of approval, but the disadvantages are also obvious: the tool's capabilities are tied to platform rules; when Pinterest adjusts its algorithm or restricts third-party API calls, the tool becomes completely ineffective. Compare this to Zangshifu Skill; the former is a platform-specific tool, while the latter is a cross-platform solution. Platform-specific tools are safer but more vulnerable, while cross-platform tools are more flexible but more complex—this is a recurring trade-off in the field of AI tools.

The risk structures of the three approaches differ. AI-generated images offer the most freedom, but each update requires adapting to new detection models. Template engines are the most stable but may be falsely penalized by anti-spam rules. HTML rendering falls somewhere in between: the layout is flexibly controlled by AI, while pixels are handled by the browser and real-world footage. This avoids detection at the "AI-generated pixels" level, but it cannot cope with the platform's semantic-level rule expansion.

The upper limit of a layout system lies not in the code, but in the content type.

The 28 layout skeletons cover two mainstream visual systems: magazine style and Swiss style. This system is a perfect match for travel bloggers who need to display maps, routes, timelines, and multi-day itineraries. Map annotations and itinerary connections are the core information of these notes, and the layout skeletons structure the information while maintaining a professional look.

But Xiaohongshu's content ecosystem is far richer than just travel guides. Fashion tips rely on personal photography style and color palettes, beauty reviews require high-resolution macro photos and product comparison images, and lifestyle content heavily utilizes multi-image collages and handwritten annotations. The "layout" of these content types is not a structured presentation of information, but rather an expression of personal aesthetics and emotions. In this context, the 28 layout templates are not tools, but constraints.

Technical limitations are also real. Currently, it supports three sizes: 1080×1440 (Xiaohongshu 3:4), 2100×900 (WeChat Official Account 21:9), and 1080×1080 (WeChat Official Account 1:1). It does not support 9:16 vertical covers for Douyin or 16:9 horizontal covers for Bilibili. The image library relies on Unsplash and Pexels, whose materials tend to be high-quality photography, suitable for travel, landscape, and urban architecture images. However, the coverage of frequently used materials for vertical content such as food close-ups, cosmetics shots, and clothing items is limited in these image libraries. A user-first approach can partially alleviate this problem, provided that creators have sufficient real-life footage to accumulate.

The validation mechanism is a double-edged sword. `validate-social-deck.mjs` can intercept layout issues before rendering images, ensuring error-free batch rendering for 100 times. This guarantees efficiency in operational scenarios requiring dozens of images daily. However, it also means that any design that doesn't conform to the preset layout rules will be rejected by the script. Creators who want to add a slanted text decoration or custom margins to the standard layout can't simply drag and adjust as they would in Canva; they need to directly edit the HTML and CSS source code.

The local deployment barrier is another point of stratification. Creators who can run Playwright and Node scripts can delve into the layout skeleton and rendering scripts for customization. However, most Xiaohongshu bloggers can only access a subset of the web interface's functionality. The actual value these two types of users derive from this skill differs greatly. The core user group of open-source projects consists of creators and developers who are willing to experiment and have technical backgrounds, rather than ordinary content producers with a "one-click image generation" need.

There is no single answer, but the divergence in technological approaches speaks volumes.

A Xiaohongshu travel blogger faces three choices: use Midjourney to generate illustrated itinerary maps, risking being flagged and penalized; use Bannerbear to set up templates and batch-feed data daily, risking anti-spam measures due to template homogenization; or use Zangshifu's Skill to let AI select a layout and render the image using HTML, risking the platform expanding its definition of "synthetic content." There are no safe options, only combinations of different risk structures.

This situation itself conveys a message: the iterative battle between platforms and AI tools has begun. Each time a platform updates its detection model, the technological advantages of a batch of tools come to an end. Each time a new tool finds a workaround, the platform adjusts its strategy. This is not a process that will converge to a stable state. The lifespan of the HTML rendering solution depends on whether Xiaohongshu's audio-visual recognition model continues to focus on "diffusion model pixel features" or expands to "all non-native photographic pixels."

For content creators, distinguishing between "AI-assisted" and "AI-replaced" content becomes practically significant. Platforms have made their stance clear: they encourage AI as a creative amplifier but oppose its use to replace humans in low-quality, mass production. In Zang Shifu Skill, AI makes layout decisions rather than content generation; the photos are real shots, and the layout is a pre-designed framework by a human designer. This falls precisely into the "AI-assisted" category. Those text-to-image content generated entirely by generative models are the ones platforms are explicitly targeting.

Whether this distinction will become an operational standard for platform review remains uncertain. However, tool developers are already responding to this definition with technological choices.