Is there a library for Python that gives the script name for a given unicode character or string? -
is there library tells script particular unicode character belongs to?
for example input "u'ሕ'" should return ethiopic or similar.
you can parse scripts.txt
file:
# -*- coding: utf-8; -*- import bisect script_file = "/path/to/scripts.txt" scripts = [] open(script_file, "rt") stream: line in stream: line = line.split("#", 1)[0].strip() if line: rng, script = line.split(";", 1) elems = rng.split("..", 1) start = int(elems[0], 16) if len(elems) == 2: stop = int(elems[1], 16) else: stop = start scripts.append((start, stop, script.lstrip())) scripts.sort() indices = [elem[0] elem in scripts] def find_script(char): if not isinstance(char, int): char = ord(char) index = bisect.bisect(indices, char) - 1 start, stop, script = scripts[index] if start <= char <= stop: return script else: return "unknown" print find_script(u'a') print find_script(u'Д') print find_script(u'ሕ') print find_script(0x1000) print find_script(0xe007f) print find_script(0xe0080)
note code neither robust nor optimized. should test whether argument denotes valid character or code point, , should coalesce adjacent equivalent ranges.
Comments
Post a Comment