之前鬼使神猜地选了一门新闻学院的创新科研项目,主要是研究普通话语料库的,最后果然是沦落到写网页的地步,这里稍作记录:
因为语料库的数据库是基于小组成员之前所听写的普通话测试语料,文本编码格式不一,直接导入MariaDB必然会乱码,所以首先把所有的文本转换成UTF-8编码:
1 |
find . -type f -exec sh -c "enconv -L zh_CN -x UTF-8 {}" \; |
然后创建数据库:
1 |
create database news; |
创建表:
1 2 3 4 5 6 7 8 9 10 |
DROP TABLE IF EXISTS Source; CREATE TABLE Source ( file_id BIGINT NOT NULL AUTO_INCREMENT, PRIMARY KEY (file_id), filename VARCHAR(40) NOT NULL, linenumber INT NOT NULL, sentence LONGTEXT NOT NULL ); |
接着就是处理txt文件了,txt文件中是由一行行的文本组成,需要把每个txt文件改成sql语句的格式方便直接source导入数据库。这种逼格满满的事情自然而然是应该用sed awk之流或者Python,PHP什么的来写,然而这些脚本语言我实在是不会。。。所以我最后用了C++。。。再low也得认了。。。囧rz:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
#include <iostream> using namespace std; #include <string> #include <stdio.h> int main(int argc, char ** argv) { freopen(argv[1], "r", stdin); string tmp = argv[1]; string filename =""; for( int i = tmp.length()-1; tmp[i] != '/'; i-- ) filename = tmp[i]+filename; tmp = "/home/razrlele/news/sql/" + filename+".sql"; freopen(tmp.c_str(), "w", stdout); string in; cout << "INSERT INTO Source(filename, linenumber, sentence) VALUES" << endl; int linenum = 1; while(getline(cin , in)){ if( linenum > 1 ) cout << "," << endl; cout << "(\'"<< filename << "\',"<<linenum++<< ",\'" << in <<"\'" <<")"; } cout << ";" << endl; return 0; } |
结合shell script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#!/bin/bash read dir cd $dir for i in `find -type f` do ~/news/transform $i done cd ~/news/sql for i in `ls` do echo source ~/news/sql/$i >> ~/news/sql/out.sql done |
在source进数据库之前要把 character_set_filesystem 设置成UTF-8:
1 |
ALTER DATABASE news CHARACTER SET utf8 COLLATE utf8_general_ci; |
最后就是搭页面了。。。
目测最快的解决方案就是PHP了。。。
先从这里找到了一些源码,然后根据自己需要删改了一下,期间先是踩了Arch装Nginx的坑(目测是受之前安装的影响,不停地403,也不知道是哪个配置有了问题),后来干脆放弃直接在服务器上写,然后晚上edu网络各种爆炸,还好用的是mosh,断线切网再重连毫无压力。接着又踩了一个数据库的坑,大概就是忘记把query转成UTF-8,搞得query死都不能执行成功,各种debug最后算是发现了问题,虽然最终的检索依旧比较残废,但那应该是MySql中文检索不行的锅,以后有机会再优化吧。
首页模板:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
<head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>NCU Chinese Library</title> <link href='https://fonts.yueyu.io/css?family=Droid+Serif' rel='stylesheet' type='text/css'> <style type="text/css"> body { text-align: center; color: #0f0; font-family: 'Droid Serif','Open Sans' ,'YuMincho','Adobe Ming Std','LiSong Pro Light','Songti TC','PMingLiU',serif; font-size: 20px; background: #000;} table ,th ,td{ border:1px solid #0ff;} a{color: #0ff; text-decoration:none;} a:hover{color: #fff; background:#0ff;} .indexPage a{color:#0f0; } .searchArea{width:600px; border:2px solid #0ff; margin-bottom:20px; margin-right:auto; margin-left:auto; text-align: center;} .footer {color: #0f0; text-align:center;} </style> </head> <body> <div class="indexPage"> <a href="http://cl.yueyu.io"><p style="font-size:70px;">南昌大学中文语料库</p></a> </div> <div class="searchArea"> |
index页:
1 2 3 4 5 6 7 8 9 10 11 |
<?php include_once('header.php'); ?> <p>输入关键字搜索</p> <form action="search.php" method="post"> <input name="keyword" size="50"> <input type="submit" value="Search"> </form> <?php include_once('footer.php'); ?> |
数据库链接:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
<?php # For database connection function db_connect(){ try { $result = mysql_connect('localhost:3306','username','password'); mysql_query("set names 'utf8' ");//就 mysql_query("set character_set_client=utf8");//是 mysql_query("set character_set_results=utf8");//这里!!(给query调成UTF-8 mysql_select_db('news', $result); } catch (Exception $e) { echo $e->message; exit; } if (!$result) { return false; } return $result; } ?> |
检索处理:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
<?php header("Content-Type: text/html; charset=utf-8"); include_once('db_fns.php'); include_once('header.php'); $handle = db_connect(); if ($_REQUEST['keyword']){ $keywords = explode(' ', $_REQUEST['keyword']); $num_keywords = count($keywords); for ($i = 0; $i<$num_keywords; $i++){ if($i){ $keywords_string .= "or ".$keywords[$i].""; }else{ $keywords_string .= $keywords[$i]; } } $query = " select file_id,linenumber, sentence from Source where sentence like BINARY '%".$keywords_string."%';" ; $result = mysql_query($query); } echo '<p>搜索 '; echo "<strong>".$keywords_string."</strong>"; echo ' 的结果</p>'; echo '<table>'; if($result && mysql_num_rows($result) ){ echo "<tr>"; echo " <td style=\"width:100px;text-align:center; \"> 语料ID</td>"; echo " <td style=\"width:100px;text-align:center; \">语句所在行数</td>"; echo " <td style=\"width:400px;text-align:center; \">语句</td>"; echo '</tr>'; while($matches = mysql_fetch_array($result)){ echo "<tr>"; echo " <td style=\"width:100px;\"> ".$matches['file_id']."</td>"; echo " <td style=\"width:100px;\">".$matches['linenumber']."</td>"; echo " <td style=\"width:400px;\">".$matches['sentence']."</td>"; echo '</tr>'; } }else{ echo '无匹配'; } echo '</table>'; include_once('footer.php'); ?> |
脚注:
1 2 3 4 5 |
</div> <div class='footer'> Powered by <a href="http://blog.razrlele.com">razrlele</a> </div> </body> |
算是第一次真正接触PHP,第一感受就是简单,上手确实比较快,也是第一次接触整个建站过程(虽然都是DEMO级别),感觉还是有点收获,要是能在大一大二的时候有这样一次经历我觉得是更好。