Gemma 3 on mobile and web with Google AI Edge

Gemma 3 on mobile and web with Google AI Edge

Gemma 3 1B is a brand new mannequin measurement within the Gemma household of open weight fashions that really opens the likelihood for distributing in-app small language fashions (SLMs) throughout cellular and internet. When deploying SLMs in manufacturing settings, fashions should be sufficiently small to obtain shortly, run quick sufficient to carry consumer consideration, and assist a variety of finish consumer units.

At solely 529MB in measurement, Gemma 3 1B runs at as much as 2585 tok/sec on prefill through Google AI Edge’s LLM inference, creating the power to course of a web page of content material in below a second. Together with Gemma 3 1B in your app, you should utilize pure language to drive your software or generate content material from in-app information or context, all absolutely customizable and fine-tunable.

On this publish, we’ll information you thru some instance use instances for Gemma 3 in your software, the way to get began with Gemma on Android, dive into a number of the efficiency metrics, and clarify how all of this was achieved.

Sorry, your browser would not assist playback for this video

Flip app information into personalised content material on Android utilizing Gemma 3 1B

What Can I Do With Gemma 3 in My App?

With a completely on-device Gemma 3 1B mannequin, you’ll be able to make the most of the advantages of AI Edge:

  1. Offline Availability: Allow your app to work absolutely when WiFi or mobile information is unavailable.

2. Price: With no cloud payments, allow free or freemium apps.

3. Latency: Some options should be sooner than a server name permits.

4. Privateness: Deliver intelligence to information that’s unable to go away the gadget or is end-to-end encrypted.

Gemma 1B is extraordinarily versatile and might even be fine-tuned in your personal area and use instances. Listed below are only a few of our favourite use instances for Gemma 1B:

  1. Information Captioning: Flip your app information into partaking and shareable descriptions, i.e, Sleep Information -> “You slept effectively for 7 hours however you stirred awake 5 instances between 2am and 4am”.

2. In-Sport Dialog: Create NPC dialog based mostly on the present recreation state.

3. Good Reply: Present customers with clever conversation-aware urged responses whereas messaging.

4. Doc Q&A: Use Gemma 3 together with our new AI Edge RAG SDK to ingest lengthy paperwork and reply consumer questions.

Getting began

Step 1: Load the Demo app

Obtain Google AI Edge’s pre-built demo app from GitHub and push it to your native Android gadget. For greatest efficiency with Gemma 3 1B, we advocate a tool with no less than 4GB of reminiscence.

$ wget https://github.com/google-ai-edge/mediapipe-samples/releases/obtain/v0.1.3/llm_inference_v0.1.3-debug.apk
$ adb set up llm_inference_v0.1.3-debug.apk

Alternatively, you may observe our instructions to construct the app from supply.


Step 2: Choose CPU or GPU

The Gemma 3 mannequin file gives nice deployment flexibility, working seamlessly on both your gadget’s CPU or cellular GPU. You may select to run Gemma 3 on CPU or GPU if you first begin the app, or swap between fashions and backends by going again to the mannequin choice dialog.


Step 3: Obtain the Mannequin from Hugging Face

On the mannequin choice display screen within the demo app, select your mannequin. The app will direct you to Hugging Face to login and settle for the Gemma phrases of use. Gemma 3 1B, quantized at int4, can be downloaded immediately from the LiteRT HuggingFace community organization, and can then be optimized as soon as to run in your gadget (however this solely takes a number of seconds!).


Step 4: Run the Mannequin

Now it is time to put Gemma 3 to work! Beneath the hood, Gemma 3 is powered by Google AI Edge’s LLM Inference API, designed for environment friendly on-device processing.

You may work together with the mannequin by chatting with it. Or, you may give it different textual content processing duties. For instance, strive the next:

  • Copy a number of paragraphs from a weblog publish (like this one) or an article.
  • Swap over to the LLM Demo app.
  • Paste the copied textual content into the enter field.
  • Sort “Create a social media publish for this content material. Maintain it brief and candy. Lower than 50 phrases” and press enter.

Step 5: Customise Gemma 3 (optionally available)

One of many nice issues concerning the Gemma household of open weight fashions are the fine-tuned variations produced by the modeling community. Comply with this Colab to see how you should utilize your personal information to create your personal model of Gemma 3 1B, quantize it, and get it working on cellular units (CPU and GPU) in your personal purposes!

Efficiency

Sorry, your browser would not assist playback for this video

Create social media content material regionally in-browser utilizing Gemma 3 1B

The demo and measurements listed below are for the Gemma 3 1B mannequin with int4 parameters quantized through quantized-aware coaching (QAT) which supplies important storage financial savings and elevated decode throughput. The benchmarked Gemma 3 mannequin helps a number of prefill lengths of 32, 128, 512 and 1024 and it makes use of a context size of 2048.

Measurements have been taken on an Android Samsung Galaxy S24 Extremely with cpufreq governor set to efficiency.
Noticed efficiency might range relying in your cellphone’s {hardware} and present exercise stage.

Web performance measurements taken on MacBook Pro 2023 (Apple M3 Pro chip)

Measurements have been taken on MacBook Professional 2023 (Apple M3 Professional chip)
Noticed efficiency might range relying in your pc’s {hardware} and present exercise stage.

Beneath the hood

The efficiency outcomes described above have been achieved via intensive optimization efforts. These optimizations have been designed to work effectively throughout open weight fashions, together with Gemma. Listed below are some key options that considerably boosted efficiency and enabled new, reusable performance.

Quantization: Quantization-aware coaching was utilized to Gemma utilizing a 4-bit integer channel-wise scheme on weights to take care of optimum efficiency, mannequin high quality, and measurement. Along with weight quantization, we additionally dynamically quantize the activation to int8 throughout execution to greatest make the most of CPU functionality.

Updating the KV Cache layouts: The KV cache is utilized in Transformer based mostly fashions to retailer the key-value pairs from earlier steps to allow them to be used to generate subsequent tokens. Reads and writes to the KV cache occur steadily so it is necessary that these operations are environment friendly. These operations have been optimized by introducing a KV Cache format to scale back further transposes and reshapes. This optimization improved latency on Gemma fashions by roughly 25% for CPU and 20% for GPU. An additional operation was additionally added to extra to performantly replace the KV cache in-place on the GPU.

Improved Loading Time: To benefit from CPU and GPU processing, we use specialised tensor layouts. Producing these optimized weight layouts can take time, energy and important reminiscence. Throughout the first mannequin load, the weights are cached on disk of their optimized format and subsequent masses learn from the cache. If tensor layouts are additional optimized, the prevailing cache will mechanically be invalidated and the brand new format can be saved on disk throughout the subsequent mannequin load.

GPU Weight Sharing: The LLM inference course of has two phases: prefill and decode. These phases usually use separate sources for his or her respective fashions. To dramatically scale back the reminiscence footprint of LLMs, each phases can share the identical weights. Whereas this system is not fully new, that is the primary time it has been performed in an simply reusable manner within the LiteRT Runtime and GPU Delegate. For ops that assist this function, the GPU delegate checks if the weights are already current in GPU reminiscence and may be reused. Sooner or later, different fashions will have the ability to trivially make the most of this functionality.

What’s subsequent

Throughout the growth of Gemma 3, we centered on delivering wonderful efficiency whereas additionally constructing reusable infrastructure for open weight fashions. In 2025, we plan to leverage this work to assist a wider set of third-party fashions. With further efficiency optimizations and an emphasis on additional lowering reminiscence use, we intend to proceed making fashions extra accessible on a wider vary of units. To maintain up with the newest developments, arrange notifications for ai_edge_torch on GitHub. Extra to come back quickly!


Acknowledgements

Advait Jain, Akshat Sharma, Alan Kelly, Andrei Kulik, Byungchul Kim, Chunlei Niu, Chun-nien Chan, Chuo-Ling Chang, Claudio Basile, Cormac Brick, Ekaterina Ignasheva, Eric Yang, Fengwu Yao, Frank Ban, Gerardo Carranza, Grant Jensen, Haoliang Zhang, Henry Wang, Ho Ko, Jae Yoo, Jiuqiang Tang, Juhyun Lee, Jun Jiang, Khanh LeViet, Kris Tonthat, Lin Chen, Lu Wang, Malini P V, Marissa Ikonomidis, Mark Sherwood, Matthew Soulanille, Matthias Grundmann, Mogan Shieh, Mohammadreza Heydary, Na Li, Pauline Sho, Pedro Gonnet, Ping Yu, Pulkit Bhuwalka, Quentin Khan, Ram Iyengar, Raman Sarokin, Rishika Sinha, Rishubh Khurana, Ronghui Zhu, Sachin Kotwani, Sebastian Schmidt, Steven Toribio, Suleman Shahid, T.J. Alumbaugh, Tenghui Zhu, Terry (Woncheol) Heo, Tyler Mullen, Vamsi Manchala, Vitalii Dziuba, Wai Hon Legislation, Weiyi Wang, Xu Chen, Yishuang Pang, Youchuan Hu, Yu-hui Chen, Zichuan Wei

Leave a Reply