之前鬼使神猜地选了一门新闻学院的创新科研项目,主要是研究普通话语料库的,最后果然是沦落到写网页的地步,这里稍作记录:
因为语料库的数据库是基于小组成员之前所听写的普通话测试语料,文本编码格式不一,直接导入MariaDB必然会乱码,所以首先把所有的文本转换成UTF-8编码:
| 1 | find . -type f -exec sh -c "enconv -L zh_CN -x UTF-8 {}" \; | 
然后创建数据库:
| 1 | create database news; | 
创建表:
| 1 2 3 4 5 6 7 8 9 10 | DROP TABLE  IF EXISTS Source; CREATE TABLE Source (      file_id BIGINT  NOT NULL AUTO_INCREMENT,      PRIMARY KEY (file_id),       filename VARCHAR(40) NOT NULL,      linenumber INT NOT NULL,      sentence LONGTEXT NOT NULL ); | 
接着就是处理txt文件了,txt文件中是由一行行的文本组成,需要把每个txt文件改成sql语句的格式方便直接source导入数据库。这种逼格满满的事情自然而然是应该用sed awk之流或者Python,PHP什么的来写,然而这些脚本语言我实在是不会。。。所以我最后用了C++。。。再low也得认了。。。囧rz:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | #include <iostream> using namespace std; #include <string> #include <stdio.h> int main(int argc, char ** argv) {     freopen(argv[1], "r", stdin);     string tmp = argv[1];     string filename ="";     for( int i = tmp.length()-1; tmp[i] != '/'; i-- )         filename  = tmp[i]+filename;     tmp = "/home/razrlele/news/sql/" + filename+".sql";     freopen(tmp.c_str(), "w", stdout);     string in;     cout << "INSERT INTO Source(filename, linenumber, sentence) VALUES" << endl;     int linenum = 1;     while(getline(cin , in)){         if( linenum > 1 )             cout << "," << endl;          cout << "(\'"<< filename << "\',"<<linenum++<< ",\'" << in <<"\'" <<")";     }     cout << ";" << endl;     return 0; } | 
结合shell script:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #!/bin/bash read dir cd $dir for i in `find -type f` do     ~/news/transform $i done cd ~/news/sql for i in `ls` do     echo source ~/news/sql/$i >> ~/news/sql/out.sql done | 
在source进数据库之前要把 character_set_filesystem 设置成UTF-8:
| 1 | ALTER DATABASE news CHARACTER SET utf8 COLLATE utf8_general_ci; | 
最后就是搭页面了。。。
目测最快的解决方案就是PHP了。。。
先从这里找到了一些源码,然后根据自己需要删改了一下,期间先是踩了Arch装Nginx的坑(目测是受之前安装的影响,不停地403,也不知道是哪个配置有了问题),后来干脆放弃直接在服务器上写,然后晚上edu网络各种爆炸,还好用的是mosh,断线切网再重连毫无压力。接着又踩了一个数据库的坑,大概就是忘记把query转成UTF-8,搞得query死都不能执行成功,各种debug最后算是发现了问题,虽然最终的检索依旧比较残废,但那应该是MySql中文检索不行的锅,以后有机会再优化吧。
首页模板:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 	<title>NCU Chinese Library</title>     <link href='https://fonts.yueyu.io/css?family=Droid+Serif' rel='stylesheet' type='text/css'> 	<style type="text/css">         body {         text-align: center;         color: #0f0;         font-family: 'Droid Serif','Open Sans' ,'YuMincho','Adobe Ming Std','LiSong Pro Light','Songti TC','PMingLiU',serif;         font-size: 20px;         background: #000;}         table ,th ,td{  border:1px solid #0ff;}         a{color: #0ff; text-decoration:none;}         a:hover{color: #fff; background:#0ff;}         .indexPage a{color:#0f0; }         .searchArea{width:600px;  border:2px solid #0ff;         margin-bottom:20px;         margin-right:auto; margin-left:auto; text-align: center;} 		.footer {color: #0f0; text-align:center;} 	</style> </head> <body>         <div class="indexPage"> 			<a href="http://cl.yueyu.io"><p style="font-size:70px;">南昌大学中文语料库</p></a>         </div>         <div class="searchArea"> | 
index页:
| 1 2 3 4 5 6 7 8 9 10 11 | <?php     include_once('header.php'); ?> <p>输入关键字搜索</p> <form action="search.php" method="post">     <input name="keyword" size="50">     <input type="submit" value="Search"> </form> <?php     include_once('footer.php'); ?> | 
数据库链接:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | <?php # For database connection function db_connect(){ 	try {         $result = mysql_connect('localhost:3306','username','password');          mysql_query("set names 'utf8' ");//就          mysql_query("set character_set_client=utf8");//是          mysql_query("set character_set_results=utf8");//这里!!(给query调成UTF-8 	 mysql_select_db('news', $result); 	} catch (Exception $e) { 		echo $e->message; 		exit; 	} 	if (!$result) { 		return false; 	} 	return $result; } ?> | 
检索处理:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | <?php header("Content-Type: text/html; charset=utf-8");     include_once('db_fns.php');     include_once('header.php');     $handle = db_connect();     if ($_REQUEST['keyword']){         $keywords = explode(' ', $_REQUEST['keyword']);         $num_keywords = count($keywords);         for ($i = 0; $i<$num_keywords; $i++){             if($i){                 $keywords_string .= "or  ".$keywords[$i]."";             }else{                 $keywords_string .= $keywords[$i];             }         }         $query = " select file_id,linenumber, sentence                          from Source                          where  sentence like BINARY '%".$keywords_string."%';" ;         $result = mysql_query($query);     }     echo '<p>搜索  ';     echo "<strong>".$keywords_string."</strong>";     echo '   的结果</p>';         echo '<table>';     if($result && mysql_num_rows($result) ){             echo "<tr>";             echo " <td style=\"width:100px;text-align:center; \"> 语料ID</td>";             echo " <td style=\"width:100px;text-align:center; \">语句所在行数</td>";             echo " <td style=\"width:400px;text-align:center; \">语句</td>";             echo '</tr>';         while($matches = mysql_fetch_array($result)){             echo "<tr>";             echo " <td style=\"width:100px;\"> ".$matches['file_id']."</td>";             echo " <td style=\"width:100px;\">".$matches['linenumber']."</td>";             echo " <td style=\"width:400px;\">".$matches['sentence']."</td>";             echo '</tr>';         }     }else{         echo '无匹配';     }      echo '</table>';     include_once('footer.php'); ?> | 
脚注:
| 1 2 3 4 5 |         </div> 		<div class='footer'> 		 Powered by <a href="http://blog.razrlele.com">razrlele</a> 		</div> </body> | 
算是第一次真正接触PHP,第一感受就是简单,上手确实比较快,也是第一次接触整个建站过程(虽然都是DEMO级别),感觉还是有点收获,要是能在大一大二的时候有这样一次经历我觉得是更好。