Model Details

This model is a mixed int4 model with group_size 128 and symmetric quantization of zai-org/GLM-4.6 generated by intel/auto-round via OPT RTN(with algorithm tuning). Non expert layers are fallback to 8 bits. Please refer to Section Generate the model for more details. Please follow the license of the original model.

How To Use

INT4 Inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "Intel/GLM-4.6-int4-mixed-AutoRound"
messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
inputs = inputs.to(model.device)
inputs.pop("token_type_ids")
generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1] :])
print(output_text)
"""
<think>1.  **Deconstruct the User's Request:**
    *   **Core Subject:** Large Language Model (LLM).
    *   **Task:** "Give me a short introduction."
    *   **Implicit Needs:**
        *   What is it? (Definition)
        *   How does it work (in simple terms)? (Mechanism)
        *   What can it do? (Applications/Examples)
        *   Why is it important? (Significance)
        *   Keep it "short" and accessible. Avoid overly technical jargon.

2.  **Brainstorming Key Concepts & Keywords:**
    *   **What is it?** AI, deep learning, neural network, text, language, human-like, prediction, probability, next word.
    *   **How does it work?** Training, massive data (internet, books), parameters (billions), patterns, grammar, context, relationships, transformer architecture (maybe too technical for a *short* intro, but good to have in mind).
    *   **What can it do?** Chatbots (like me!), writing, summarizing, translating, coding, answering questions, creative writing.
    *   **Why is it important?** Revolutionizing tech, human-computer interaction, productivity tool, new possibilities, ethical concerns (bias, misinformation).
    *   **Analogies:** "Supercharged autocomplete," "linguistic chameleon," "statistical parrot" (maybe too negative for an intro), "pattern-matching engine." The "supercharged autocomplete" is a great, simple analogy.

3.  **Structure the Introduction:** I'll use a classic "funnel" approach.
    *   **Hook/Simple Definition:** Start with a very simple, relatable definition. What is the absolute core idea?
    *   **The "How" (Simplified):** Explain the training process without getting bogged down in technical details. Focus on the *scale* (massive data, huge size).
    *   **The "What" (Capabilities):** List concrete examples of what they can do. This makes it real for the user.
    *   **The "Why" (Significance):** Briefly touch on why this is a big deal.
    *   **Concluding Thought:** A short, forward-looking statement.

4.  **Drafting - Version 1 (Mental or rough notes):**
    *   LLMs are AI for language.

"""

Generate the model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
from auto_round import AutoRound
from auto_round.utils import llm_load_model

model_name = "zai-org/GLM-4.6"
model, tokenizer = llm_load_model(model_name, device="cpu")

layer_config = {}
for n, m in model.named_modules():
    if isinstance(m, torch.nn.Linear):
        if "expert" in n and "shared_experts" not in n:
            layer_config[n] = {"bits": 4}
            print(n, 4)
        elif n != "lm_head":
            layer_config[n] = {"bits": 8}
            print(n, 8)

autoround = AutoRound(model, tokenizer, iters=0, layer_config=layer_config)
autoround.quantize_and_save(format="auto_round", output_dir="tmp_autoround")

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

F32

I32

BF16

F16

Model tree for Intel/GLM-4.6-int4-mixed-AutoRound

Base model

zai-org/GLM-4.6

Quantized

(41)

this model